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Preface 





Welcome to the UltraSPARC® IIIi Processor User's Manual. This book contains information 
about the architecture and programming of the UltraSPARC IIIi processor, one of Sun 
Microsystems’ family of SPARC? V9-compliant processors. 





Target Audience 


This user's manual is mainly targeted for programmers who write software for the 
UltraSPARC IIIi processor. This user's manual contains a depository of information that is 
useful to operating system programmers, application software programmers, logic designers 
and third party vendors who are trying to understand the architecture and operation of the 
UItraSPARC IIIi processor. This manual is both a guide and a reference manual for low-level 
programming of the processor. 





A Brief History of SPARC 


SPARC stands for Scalable Processor ARChitecture, which was first announced in 1987. 
Unlike more traditional processor architectures, SPARC is an open standard freely available 
through license from SPARC International, Inc. Any company that obtains a license can 
manufacture and sell a SPARC-compliant processor. 


By the early 1990s, SPARC processors were available from over a dozen different vendors, 
and over 8,000 SPARC-compliant applications had been certified. 


In 1994, SPARC International, Inc. published The SPARC Architecture Manual, Version 9, 
which defined a powerful 64-bit enhancement to the SPARC architecture. SPARC V9 
provided support for the following: 


64-bit virtual addresses and 64-bit integer data 


XXV 


Fault tolerance 
Fast trap handling and context switching 
Big- and little-endian byte orders 


UltraSPARC is the first family of SPARC V9-compliant processors available from Sun 
Microsystems, Inc. 





Prerequisites 


This user’s manual is a companion to The SPARC Architecture Manual, Version 9. The reader 
of this user’s manual should be familiar with the contents of The SPARC Architecture 
Manual, Version 9, which is available from many technical bookstores or directly from its 
copyright holder: 


SPARC International, Inc. 
2242 Camden Ave, Suite #105 
San Jose, CA 95124 

(408) 558-8111 
http://www.sparc.org 





The SPARC Architecture Manual, Version 9 provides a complete description of the 

SPARC V9 architecture. Since SPARC V9 is an open architecture, many of the 
implementation decisions have been left to the manufacturers of SPARC-compliant 
processors. These “implementation dependencies” are introduced in The SPARC Architecture 
Manual, Version 9. 
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User’s Manual Overview 


This manual is focused on the treatment of the UltraSPARC IIi processor. However, it 
sometimes refers to the UltraSPARC III family of processors to indicate generality of a 
certain feature. The term “UltraSPARC III family of processors” refers to processors that are 
similar to the UltraSPARC IlIi processor. 


This manual is divided into multiple sections. These sections are described next. 
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Processor Introduction 


The processor introduction section describes the high level features of the UltraSPARC IIIi 
processor. This section also discusses how the UltraSPARC IIIi processor is used in a system. 


Architecture and Functions 


This section discusses the details of the UltraSPARC IIIi architecture and the functions of 
various processor units. An entire chapter is devoted to a discussion on the instruction 
execution pipeline. 


Execution Environment 


This section describes the details necessary to understand the execution environment. Various 
topics such as memory models, data formats, registers, and instruction types are discussed. 


Memory and Cache 


This section describes the details of memories and caches. Topics such as memory models, 
memory sub-system, and caches are discussed. 


Supervisor Programming 


Supervisor software controls the processor and the instruction execution environment for 
itself and application programs. Chapters are devoted to interrupt handling and error 
handling. 


Performance Programming 


This section explores the opportunities to exploit the high-performance architecture of the 
processor, that is, performance instrumentation. 


Instruction Definitions Appendix 


This section describes, in detail, each instruction for the UltraSPARC IIi processor. 
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SPARC V9 Architecture 


The SPARC Architecture Manual, Version 9 was used to implement the processor to insure 
SPARC compatibility for user and application programs. The SPARC V9 manual provides 
important theoretical information for operating system programmers who write memory 
management software, compiler writers who write machine-specific optimizers, and anyone 
who writes code to run on all SPARC V9-compatible machines. Book copies of the The 
SPARC Architecture Manual, Version 9 are readily available at bookstores or from SPARC 
International, Inc. 


Software that is intended to be portable across all SPARC V9 processors should adhere to 
The SPARC Architecture Manual, Version 9. 


In this book, the word architecture refers to the machine details that are visible to an 
assembly language programmer or to the compiler code generator. It does not, necessarily, 
include details of the implementation that are not visible or easily observable by software. 
Where such details are provided, the intent is to enable faster and better programs. 
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Textual Usage 


Fonts 


Fonts are used as follows: 


Italic sans serif font is used for exception and trap names. “The privileged_action 
exception...” is an example of how this font is used, it is also used for assembly language 
terms, emphasis, book titles, and the first instance of a word that is defined. 


Courier font is used for register fields (named bits), instruction fields, and read-only 
register fields. “The rs1 field contains...” is an example of how this font is used. It is also 
used for literals, instruction names, register names, and software examples. 


UPPERCASE items are acronyms, instruction names, or writable register fields. Some 
common acronyms are listed in Acronyms and Definitions. Note: Names of some 
instructions contain both uppercase and lowercase letters. 


Underbar characters join words in register, register field, exception, and trap names. Note: 
Such words can be split across lines at the underbar without an intervening hyphen. “This 
is true whenever the integer_condition_code field...” is an example of how the underbar 
characters are used. 
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Notational Conventions 


The following notational conventions are used: 


Square brackets, [ ], indicate a numbered register in a register file. For example, r[0] 
translates to register 0. 


Angle brackets, < >, indicate a bit number or colon-separated range of bit numbers within 
a field. “Bits FSR<29:28> and FSR<12> are...” is an example of how the angle brackets 
are used. 


Curly braces, {}, indicate textual substitution. For example, the string 
“PRIMARY {_LITTLE}” expands to ASI PRIMARY" and "ASI PRIMARY LITTLE." 


If the bar, |, is used with the curly braces, it represents multiple substitutions. For 
example, the string "ASI DMMU TSB (8KB|64KB|DIRECT] PTR. REG" expands to 
"ASI DMMU TSB 8KB PTR, REG", “ASI DMMU TSB 64KB PTR, REG", and 
"ASI DMMU  TSB DIRECT PTR, REG:-" 

















The | | symbol designates concatenation of bit vectors. A comma (,) on the left side of an 
assignment separates quantities that are concatenated for the purpose of assignment. For 
example, if X, Y, and Z are 1-bit vectors and the 2-bit vector T equals 115, then 














CX Ze DE 





results in X 20, Y = 1, and Z= 1. 


“A mod B" means “A modulus B," where the calculated value is the remainder when A is 
divided by B. 


Notation for Numbers 


Numbers throughout this specification are decimal (base-10) unless otherwise indicated. 
Numbers in other bases are followed by a numeric subscript indicating their base (for 
example, 10015, FFFF 0000,6). In some cases, numbers may be preceded by “0x” to indicate 
hexadecimal (base-16) notation (for example, 0xFFFF.0000). Long binary and hexadecimal 
numbers within the text have spaces or periods inserted every four characters to improve 
readability. 


The notation 7h’1F indicates a hexadecimal number of 1F; with 7 binary bits of width. 


Informational Notes 


This guide provides several different types of information in notes, as follows: 
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XXX 


Programming Note — Programming notes contain incidental information about 
programming the UltraSPARC Ili processor unless otherwise restricted to a particular 
processor in the family. 





Implementation Note — Implementation notes contain information that contains 
implementation specific information of the UltraSPARC II processor compared to other 
UltraSPARC processors. 





Compatibility Note — Compatibility notes contain information relevant to the previous 
SPARC V8 architecture. 





UltraSPARC Note — UltraSPARC notes highlight the differences between the 
UltraSPARC I and UltraSPARC II processors and the UltraSPARC III family of processors. 
This note shows architectural and functional differences that may be generalized or 
applicable to one particular processor in one of the families. Check the appropriate User’s 
Manual or section in this User’s Manual to determine individual processor functionality as 
needed. 





Note — This highlights a useful note regarding important and informative processor 
architecture or functional operation. This may be used for purposes not covered in one of the 
other notes. 
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Acronyms and Definitions 





This chapter defines concepts and terminology common to all implementations of 
SPARC V9. 


address space identifier 
AFAR 


AFSR 


aliased 


application program 


ASI 


AX 


big-endian 


BLD 
BST 


byte 


clean window 


coherence 


See ASI 
Asynchronous Fault Address Register 


Asynchronous Fault Status Register 


Two virtual addresses that refer to the same physical address 


A program executed with the processor in non-privileged mode. Note: Statements 
made in this specification regarding application programs may not be applicable to 
programs (for example, debuggers) that have access to privileged processor state (for 
example, as stored in a memory-image dump). 


Address Space Identifier. An 8-bit value that identifies an address space. For each 
instruction or data access, the integer unit appends an ASI to the address. See also 
implicit ASI. 


Ancillary State Register 
Either the AO or A1 pipeline 


An addressing convention. Within a multiple-byte integer, the byte with the smallest 
address is the most significant; a byte's significance decreases as its address increases. 


Block Load 
Block Store 


Eight consecutive bits of data 


A register window in which all of the registers contain zero, a valid address from the 
current address space, or valid data from the current address space. 


A set of protocols guaranteeing that all memory accesses are globally visible to all 
caches on a shared-memory bus. 


xxxi 


xxxii 


completed 


consistency 


context 


copyback 


CPI 


cross-call 


CSR 


current window 


D-cache 
DCTI 
DCU 


demap 


deprecated 


DFT 


DIMM 


dispatch 


doublet 


doubleword 


DOM 


ECU 


A memory transaction is completed when an idealized memory has executed the 
transaction with respect to all processors. A load is considered completed when no 
subsequent memory transaction can affect the value returned by the load. A store is 
considered completed when no subsequent load can return the value that was 
overwritten by the store. 


See coherence 


A set of translations that supports a particular address space. See also Memory 
Management Unit (MMU). 


The process of copying back a dirty cache line in response to a cache hit while 
snooping. 


Cycles Per Instruction. The number of clock cycles it takes to execute an instruction. 


An interprocessor call in a multiprocessor system 


Control Status Register 


The block of 24 r registers that is currently in use. The Current Window Pointer (CWP) 
register points to the current window. 


Level-1 data memory cache 
Delayed Control Transfer Instruction 
Data Cache Unit. Includes controller and Tag and Data RAM arrays 


To invalidate a mapping in the MMU 


The term applied to an architectural feature (such as an instruction or register) for 
which a SPARC V9 implementation provides support only for compatibility with 
previous versions of the architecture. Use of a deprecated feature must generate correct 
results but may compromise software performance. Deprecated features should not be 
used in new SPARC V9 software and may not be supported in future versions of the 
architecture. 


Designed for Test 


Dual In-line Memory Module. Provides a single or double bank of SDRAM devices 
72 bits or 144 bits of data width. 


To send a previously fetched instruction to one or more functional units for execution. 
Typically, the instruction is dispatched from a reservation station or other buffer of 
instructions waiting to be executed. See also issued. 


Two bytes (16 bits) of data 


An aligned octlet. Note: The definition of this term is architecture dependent and may 
differ from that used in other processor architectures. 


Data input/output Mask. Q stands for either input or output. 


External or embedded Cache Unit controller 
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EMU 


exception 


extended word 


f register 


fccN 

FFA or FGA or FP1 
FGM or FP0 

FGU 


floating-point 
exception 


floating-point IEEE-754 
exception 


floating-point operate 
(FPop) instructions 


floating-point trap type 


floating-point unit 


FPRS 
FPU 
FRF 
FSR 


halfword 


HBM 


External Memory Unit. A combination of the ECU and the Memory Control Unit 
(MCU). 


A condition that makes it impossible for the processor to continue executing the 
current instruction stream without software intervention. See also trap. 


An aligned octlet, nominally containing integer data. Note: The definition of this term 
is architecture dependent and may differ from that used in other processor 
architectures. 


A floating-point register. SPARC V9 includes single-, double-, and guad-precision 
f registers. 


One of the floating-point condition code fields £cc0, fccl, fcc2, or £cc3. 
Floating-Point/Graphics ALU pipeline 
Floating-Point/Graphics Multiply pipeline 


Floating Point and Graphics Unit (FPO and FP1) 


An exception that occurs during the execution of a Floating-point operate (FPop) 
instruction while the corresponding bit in FSR. TEM is set to one. The exceptions are 
unfinished FPop, unimplemented FPop, sequence, error, hardware, error, 

invalid fp register, or IEEE 754 exception. 





A floating-point exception, as specified by IEEE Standard 754-1985. Listed within this 
specification as JEEE 754 exception. 


Instructions that perform floating-point calculations, as defined by the FPop1 and 
FPop2 opcodes. FPop instructions do not include FBfcc instructions or loads and 
stores between memory and the floating-point unit. 


The specific type of a floating-point exception, encoded in the FSR. ftt field. 


A processing unit that contains the floating-point registers and performs floating-point 
operations, as defined by this specification. 


Floating Point Register State 
Floating-Point Unit 
Floating-Point Register File 
Floating-Point Status Register 


An aligned doublet. Note: The definition of this term is architecture dependent and 
may differ from that used in other processor architectures. 


Hierarchical Bus Mode 
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hexlet 
HPE 
I-cache 
IEU 
HU 


implementation 


implementation 
dependent 


implicit ASI 


informative appendix 


initiated 
instruction field 


instruction group 
instruction set 


architecture 


integer unit 


interrupt request 


ISA 


issued 


xxxiv 


Sixteen bytes (128 bits) of data 
Hardware Prefetch Enable 
Level-2 Instruction memory cache 
Instruction Execution Unit 
Instruction Issue Unit 


Hardware or software that conforms to all of the specifications of an instruction set 
architecture (ISA). 


An aspect of the architecture that can legitimately vary among implementations. In 
many cases, the permitted range of variation is specified in the SPARC V9 standard. 
When a range is specified, compliant implementations must not deviate from that 
range. 


The ASI that is supplied by the hardware on all instruction accesses and on data 
accesses that do not contain an explicit ASI or a reference to the contents of the ASI 
register. 


An appendix containing information that is useful but not required to create an 
implementation that conforms to the SPARC V9 specification. See also normative 
appendix. 


Synonym: issued 
A bit field within an instruction word 


One or more independent instructions that can be dispatched for simultaneous 
execution. 


See ISA 


A processing unit that performs integer and control-flow operations and contains 
general-purpose integer registers and processor state registers, as defined by this 
specification. 


A request for service presented to the processor by an external device 


Instruction Set Architecture. A set that defines instructions, registers, instruction and 
data memory, the effect of executed instructions on the registers and memory, and an 
algorithm for controlling instruction execution. It does not define clock cycle times, 
cycles per instruction, datapaths, etc. 


(1) A memory transaction (load, store, or atomic load-store) is “issued” when a 
processor has sent the transaction to the memory subsystem and the completion of the 
request is out of the processor's control. Synonym: initiated. 

(2) An instruction (or sequence of instructions) is said to be issued when released from 
the processor's in-order instruction fetch unit. Typically, instructions are issued to a 
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IU 
L2-cache 


leaf procedure 


little-endian 


load 


load-store 


may 


MCU 


Memory Management 
Unit 


MMU 


module 


MOESI 


must 
NaN 
NCPQ 


next program counter 


reservation station or other buffer of instructions waiting to be executed. (Other 
conventions for this term exist, but this document attempts to use “issue” consistently 
as defined here). See also dispatched. 


Integer Unit 
External or embedded unified, instruction/data, Level-2 memory cache 


A procedure that is a leaf in the program's call graph, that is, one that does not call (by 
using CALL or JMPL) any other procedures. 


An addressing convention. Within a multiple-byte integer, the byte with the smallest 
address is the least significant; a byte's significance increases as its address increases. 


An instruction that reads (but does not write) memory or reads (but does not write) 
location(s) in an alternate address space. Load includes loads into integer or 
floating-point registers, block loads, Load Quadword Atomic, and alternate address 
space variants of those instructions. See also load-store and store, the definitions of 
which are mutually exclusive with /oad. 


An instruction that explicitly both reads and writes memory or explicitly reads and 
writes location(s) in an alternate address space. Load-store includes instructions such 
as CASA, CASXA, LDSTUB, and the deprecated SWAP instruction. See also load and 
store, the definitions of which are mutually exclusive with /oad-store. 


A keyword indicating flexibility of choice with no implied preference. Note: *May" 
indicates that an action or operation is allowed; “can” indicates that it is possible. 


Memory Control Unit. Controls the SDRAM signals 


See MMU 


Memory Management Unit. The address translation hardware in the UltraSPARC Illi 
implementation that translates 64-bit virtual address into physical addresses. The 
MMU is composed of the TLBs, ASRs, and ASI registers used to manage address 
translation. See also context, physical address, and virtual address. 


A master or slave device that attaches to the shared-memory bus 


A cache-coherence protocol. Each of the letters stands for one of the states that a cache 
line can be in, as follows: M, modified, dirty data with no outstanding shared copy; O, 
owned, dirty data with outstanding shared copy(s); E, exclusive, clean data with no 
outstanding shared copy; S, shared, clean data with outstanding shared copy(s); I, 
invalid, invalid data. 


Synonym: shall 
Not a Number 
Noncoherent Pending Queue 


See nPC 
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xxxvi 


NFO 


non-faulting load 


non-privileged 


non-privileged mode 


normative appendix 


nPC 


NPT 
NWINDOWS 
OBP 


octlet 


opcode 
optional 
ORQ 
PA 


Page Table Entry 


PC 


PCR 
physical address 


PIC 


Nonfault access only 


A load operation that, in the absence of faults or in the presence of a recoverable fault, 
completes correctly, and in the presence of a nonrecoverable fault returns (with the 
assistance of system software) a known data value (nominally zero). See also 
speculative load. 


An adjective that describes: 

(1) the state of the processor when PSTATE.PRIV = 0, that is, non-privileged mode; 
(2) processor state information that is accessible to software while the processor is in 
either privileged mode or non-privileged mode; for example, non-privileged registers, 
non-privileged ASRs, or, in general, non-privileged state; 

(3) an instruction that can be executed when the processor is in either privileged mode 
or non-privileged mode. 








The mode in which a processor is operating when PSTATE.PRIV - 0. See also 
privileged. 


An appendix containing specifications that must be met by an implementation 
conforming to the SPARC V9 specification. See also informative appendix. 


Next program counter. A register that contains the address of the next executed 
instruction if a trap does not occur. 


Non-Privileged Trap 
The number of register windows present in a particular implementation 
OpenBoot™ PROM 


Eight bytes (64 bits) of data. Not to be confused with “octet,” which has been 
commonly used to describe eight bits of data. In this document, the term byte, rather 
than octet, is used to describe eight bits of data. 


A bit pattern that identifies a particular instruction 
A feature not required for SPARC V9 compliance 
Outgoing Request Queue 


Physical Address. An address that maps real physical memory or I/O device space. See 
also virtual address. 


See PTE 


Program Counter. A register that contains the address of the instruction currently being 
executed by the IU. 


Performance Control Register 
See PA 


Performance Instrumentation Counter 
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PIO 
PIPT 
PIVT 

POR 


prefetchable 


privileged 


privileged mode 


processor 
program counter 
PSO 
PTA 


PTE 


QNaN 
quadlet 


quadword 


r register 
RD 


RDPR 


Programmed I/O 
Physically Indexed, Physically Tagged 
Physically Indexed, Virtually Tagged 


Power-on Reset. The most aggressive reset. 











(1) An attribute of a memory location that indicates to an MMU that PREFETCH 
operations to that location may be applied. 

(2) A memory location condition for which the system designer has determined that no 
undesirable effects will occur if a PREFETCH operation to that location is allowed to 


succeed. Typically, normal memory is prefetchable. 

















Non-prefetchable locations include those that, when read, change state or cause 
external events to occur. For example, some I/O devices are designed with registers 
that clear on read; others have registers that initiate operations when read. See also side 
effect. 


An adjective that describes: 

(1) the state of the processor when PSTATE.PRIV = 1, that is, privileged mode; 

(2) processor state that is only accessible to software while the processor is in 
privileged mode; for example, privileged registers, privileged ASRs, or, in general, 
privileged state; 

(3) an instruction that can be executed only when the processor is in privileged mode. 








The mode in which a processor is operating when PSTATE.PRIV = 1. See also 
non-privileged. 


The combination of the integer unit and the floating-point unit 
See PC. 

Partial Store Order 

Pending Tag Array 


Page Table Entry. Describes the virtual-to-physical translation and page attributes for a 
specific page. A PTE generally means an entry in the page table or in the TLB; 
however, it is sometimes used as an entry in the translation storage buffer (TSB). In 
general, a PTE contains fewer fields than a TTE. See also TLB and TSB. 


Quiet Not a Number 
Four bytes (32 bits) of data 


Aligned hexlet. Note: The definition of this term is architecture dependent and may be 
different from that used in other processor architectures. 


An integer register. Also called a general-purpose register or working register. 
Rounding Direction 


Read Privileged Register 
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RED state 


reserved 


reset trap 


restricted 


RMO 


rsl,rs2,rd 


RTO 
RTOR 
RTS 
RTSM 
SAM 
scrub 


SDRAM 


SFAR 
SFSR 


shall 





Reset, Error, and Debug state. The processor state when PSTATE.RED— 1. A 
restricted execution environment used to process resets and traps that occur when 
TL-MAXTL- l. 











Describes an instruction field, certain bit combinations within an instruction field, or a 
register field that is reserved for definition by future versions of the architecture. 


Reserved instruction fields shall read as zero, unless the implementation supports 
extended instructions within the field. The behavior of SPARC V9 processors 
when they encounter nonzero values in reserved instruction fields is undefined. 
Reserved bit combinations within instruction fields are defined in Appendix A, 
Instruction Definitions. In all cases, SPARC V9 processors shall decode and trap on 
these reserved combinations. 


Reserved register fields should always be written by software with values of those 
fields previously read from that register or with zeroes; they should read as zero in 
hardware. Software intended to run on future versions of SPARC V9 should not 
assume that these fields will read as zero or any other particular value. Throughout 
this specification, figures and tables illustrating registers and instruction encodings 
indicate reserved fields and combinations with an em dash (—). 

A vectored transfer of control to privileged software through a fixed-address reset trap 
table. Reset traps cause entry into RED state. 





Describes an ASI that may be accessed only while the processor is operating in 
privileged mode. 


Relaxed Memory Order 


The integer or floating-point register operands of an instruction. The source registers 
are rs1 and rs2; the destination register is rd. 


Read to Own 

Read to Own Remote. A reissued RTO transaction. 

Read to Share 

Read to Share Mtag. An RTS to modify MTag transaction. 
SPARC Architecture Manual, Version 9 

Writes data from the W-cache to the L2-cache 


Synchronous Dynamic Random Access Memory. May be prefaced with DDR, double 
data rate SDRAM. 


Synchronous Fault Address Register 
Synchronous Fault Status Register 


A keyword indicating a mandatory requirement. Designers shall implement all such 
mandatory requirements to ensure interoperability with other SPARC V9-compliant 
products. Synonym: must. 
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should 


SIAM 


side effect 


SIG 
SIR 
SNaN 


snooping 


SPE 


speculative load 


store 


superscalar 


supervisor software 
TBA 


TLB 


TLB hit 


TLB miss 


A keyword indicating flexibility of choice with a strongly preferred implementation. 
Synonym: it is recommended 


Set Interval Arithmetic Mode instruction 


The result of a memory location having additional actions beyond the reading or 
writing of data. A side effect can occur when a memory operation on that location is 
allowed to succeed. Locations with side effects include those that, when accessed, 
change state or cause external events to occur. For example, some I/O devices contain 
registers that clear on read; others have registers that initiate operations when read. See 
also prefetchable. 


Single-Instruction Group. Sometimes shortened to “single-group.” 
Software-Initiated Reset 
Signalling Not a Number 


The process of maintaining coherency between caches in a shared-memory bus 
architecture. All cache controllers monitor (snoop) the bus to determine whether they 
have a copy of the shared cache block. 


Software Prefetch Enable 


A load operation that is issued by the processor speculatively, that is, before it is 
known whether the load will be executed in the flow of the program. Speculative 
accesses are used by hardware to speed program execution and are transparent to code. 
An implementation, through a combination of hardware and system software, must 
nullify speculative loads on memory locations that have side effects; otherwise, such 
accesses produce unpredictable results. Contrast with non-faulting load, which is an 
explicit load that always completes, even in the presence of recoverable faults. 


An instruction that writes (but does not explicitly read) memory or writes (but does not 
explicitly read) location(s) in an alternate address space. Store includes stores from 
either integer or floating-point registers, block stores, partial store, and alternate 
address space variants of those instructions. See also load and load-store, the 
definitions of which are mutually exclusive with store. 


An implementation that allows several instructions to be issued, executed, and 
committed in one clock cycle. 


Software that executes when the processor is in privileged mode 
Trap Base Address 


Translation Lookaside Buffer. A cache within an MMU that contains recent partial 
translations. TLBs speed up closely following translations by often eliminating the 
need to reread PTE from memory. 


The desired translation is present in the on-chip TLB 


The desired translation is not present in the on-chip TLB 


Acronyms and Definitions XXXix 


xl 


TPC 


Translation Lookaside 


Buffer 


trap 


TSB 


TSO 


TTE 


UE 


unassigned 


undefined 


unimplemented 


unpredictable 


unrestricted 


user application 
program 


VA 


victimize 


VIPT 


Trap-saved PC 


See TLB 


The action taken by the processor when it changes the instruction flow in response to 
the presence of an exception, a Tcc instruction, or an interrupt. The action is a 
vectored transfer of control to supervisor software through a table, the address of 
which is specified by the privileged TBA register. See also exception. 


Translation Storage Buffer. A table of the address translations that is maintained by 
software in system memory and that serves as a cache of the address translations. 


Total Store Order 


Translation Table Entry. Describes the virtual-to-physical translation and page 
attributes for a specific page in the Page Table. In some cases, the term is explicitly 
used for the entries in the TSB. 


User process error 


A valued (for example, an ASI number) semantics which are not architecturally 
mandated and which may be determined independently by each implementation within 
any given guidelines. 


An aspect of the architecture deliberately left unspecified. Software should have no 
expectation of, nor make any assumptions about, an undefined feature or behavior. Use 
of such a feature can deliver unexpected results, may or may not cause a trap, can vary 
among implementations, and can vary with time on a given implementation. 


Notwithstanding any of the above, undefined aspects of the architecture shall not cause 
security holes (such as allowing user software to access privileged state), put the 
processor into supervisor mode or an unrecoverable state. 


An architectural feature that is not directly executed in hardware because it is optional 
or emulated in software. 


Synonym: undefined 


Describes an ASI that can be used regardless of the processor mode; that is, regardless 
of the value of PSTATE.PRIV. 





Synonym: application program 


Virtual address. An address produced by a processor that maps all systemwide, 
program-visible memory. Virtual addresses usually are translated by a combination of 
hardware and software to physical addresses, which can be used to access physical 
memory. 


[Error handling] 


Virtually Indexed, Physically Tagged 
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virtual address See VA 


VIS Visual Instruction Set. Performs partitioned integer arithmetic and other small integer 
operations. 


VIVT Virtually Indexed, Virtually Tagged (cache) 
WAW Write After Write 
WDR  WatchDog trap-level Reset 


word An aligned quadlet. Note: The definition of this term is architecture dependent and 
may differ from that used in other processor architectures. 


WRF Working Register File 
writeback The process of writing a dirty cache line back to memory before it is refilled. 
WRPR Write Privileged Register 


XIR Externally Initiated Reset 


Acronyms and Definitions xli 
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CHAPTER 1 





Introducing the UltraSPARC IITi 
Processor 





Ll 


Overview 


The UltraSPARC IIi processor is derived from Sun Microsystems high-end UltraSPARC III 
processor, providing many of the same performance, reliability, and security features, but in 
a highly integrated format that brings the power of the UltraSPARC architecture to cost- 
efficient high-end desktop systems and inexpensive 1-4 way servers. It implements both the 
full 64-bit, SPARC V9 architecture and version 2.0 of Sun Microsystems’ VIS™ instruction 
set. The VIS instruction set provides a wide range of “Single Instruction, Multiple Data” 
(SIMD) acceleration functions for working with 8-, 16-, and 32-bit data values, doing pixel 
manipulation, 2D image processing, 3D graphics, data compression, and other specialized 
performance-critical operations. 


Major functional blocks included in the UltraSPARC Ili processor are: 


Integer execution unit 

Floating-point execution unit 

32 KB primary (Level | or L1) instruction cache 

64 KB primary (L1) data cache 

1 MB L2 unified cache (used for both instructions and data) 
2 KB prefetch cache for floating-point data 

2 KB write cache 

Synchronous DRAM (SDRAM) memory controller 

JBUS controller 


In common with all other members of the UltraSPARC III family of processors, the 
UltraSPARC IIIi processor is a 4-way superscalar processor, meaning it attempts to fetch 4 
instructions at a time from the L1 instruction cache, and (given the appropriate instruction 
mix) is capable of sustaining an execution rate of 4 instructions per clock cycle. Each 
instruction is processed through a 14-stage pipeline that starts with address generation and 
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ends with the final retirement of any valid execution result. A 16-entry instruction queue 
decouples instruction fetch from instruction issue, working to buffer any discrepancies 
between these two rates. Thus, if more instructions are fetched than can be issued repeatedly, 
an empty instruction queue gradually will fill. Or, if the next instruction fetch misses in the 
L1 cache, a filled instruction queue can hide this break in the flow of instructions through the 
pipeline, by continuing to supply the execution units with instructions for the several clock 
cycles needed to retrieve the missing block of instructions from the on-chip L2 cache. 


To enhance throughput, while instructions enter and exit the instruction queue in strict 
program order, they can complete executing out-of-order. For example, if a short latency 
instruction (like an integer add) follows a long latency instruction (like an integer divide) in 
the pipeline, the fast operation does not need to wait on the slow one to finish. Instructions 
fetched together will enter the queue in parallel, but, within the constraints imposed by 
program order, they may exit the queue in company with instructions fetched either earlier or 
later (depending on the specific instruction mix and availability of the necessary functional 
units). 


The UltraSPARC Ili processor is supported by Sun's popular Solaris™ operating system, 
providing access to the more than ten thousand applications that have been developed for the 
SPARC/Solaris platform over the years. Comprehensive sets of programs are available for 
many fields, including engineering, manufacturing, telecommunications, financial services, 
health, retail, ecommerce, and a variety of other industry segments. Additional operating 
systems available for use with UltraSPARC processors include Linux and leading real-time 
operating systems. A robust set of tools for developing software also can be readily acquired, 
either from Sun Microsystems or independent software vendors. 
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Features 


The UItraSPARC IIIi processor is richly featured, providing all of the following capabilities: 
Binary compatibility with the entire base of SPARC application code. 
Full 64-bit virtual address space. 


64-bit internal operation, including 64-bit datapaths, 64-bit ALUs, and 64-bit address 
arithmetic. 


43-bit physical address space, supporting up to 8 Terabytes of memory. 


Low latency and high bandwidth for memory operations, due in part to a memory 
hierarchy that incorporates separate on-chip L1 instruction and data caches, a 1 MB on- 
chip unified L2 cache, a prefetch cache, a write cache, and an on-chip SDRAM controller. 


1 to 4-way glueless multiprocessing. 


Introductory frequency above 1 GHz, scaling up over time, propelled by a 14-stage non- 
stalling pipeline. 


4-way superscalar instruction dispatch to nine separate execution units. 
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High-performance JBUS system interface. 
Sophisticated power management. 


Extensive RAS protection, starting with error detection and correction (EDC) on the 
primary and secondary caches. 


Compared to the previous generation UltraSPARC Ili processor, the UltraSPARC Ili 
processor offers several useful new features, including version 2.0 of the VIS instruction set, 
support for interval arithmetic, better prefetch capabilities, an extended interrupt scheme, and 
4 times as much physical address space. It combines these advantages with far greater levels 
of performance as well as greatly improved data reliability. 


The UltraSPARC IIi processor brings all the advantages of full 64-bit computing to both 
desktop systems and entry-level servers, together with up to 4-way glueless MP operation, in 
a very cost-competitive form. 
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Summary 


Detailed information about specific functional blocks and features of the UltraSPARC IIIi 
processor can be found in the following chapters of this manual. This section attempts to 
summarize the more significant elements of the UltraSPARC IIIi processor, for the benefit of 
readers seeking to guickly acguire a relatively comprehensive understanding of it. 


Register Windows 

In addition to the usual assortment of registers used for control purposes, status information, 
condition codes, etc., the UltraSPARC architecture includes 160 64-bit integer registers, and 
another set of 32 64-bit registers for use by the FPU and VIS instructions. The 160 integer 
registers are organized into 8 overlapping register “windows” of 32-registers each. In each 
register window, 8 registers are shared with the previous window, and are used to hold input 
parameters from a calling routine; 8 registers are shared with the next window, and are used 
to hold output parameters for use by a called routine; 8 registers are unshared, and are used 
to hold local parameters; while 8 registers are global, and are used to hold values shared by 
all routines. The 8 output registers for one window are the 8 input registers for the next 
window. There are four sets of 8 global registers, designated for different uses, as 
appropriate: normal, MMU, interrupt, and alternate. (8 x 8 in/out registers 8 x 8 local 
registers + 4 x 8 global registers = 160 integer registers.) Register windows are a distinctive 
feature of the SPARC architecture, designed to provide a very fast means to handle context 
switches, interrupts, and traps. 


32 KB Primary Instruction Cache Memory (4-way set associative) 

Holds 8K fixed-width 4-byte SPARC instructions for immediate access by the pipeline. 
Instructions in this cache are protected against single bit errors by parity checking. If an error 
is detected, the cache line with the erring byte is marked as invalid; as a consequence, the 
next access to that line forces it to be refilled with valid instructions from the L2 cache. 
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64 KB Primary Data Cache Memory (4-way set associative) 

Holds data items for rapid loads to and stores from the register file. (In common with other 
RISC architectures, all SPARC instructions operate register-to-register, accessing their 
operands from the register file and return their results to it.) Uses the same parity checking/ 
line invalidation scheme for EDC as the instruction cache. Cache is write-through, so data in 
the primary cache is always “clean.” 


2 KB Prefetch Cache Memory (4-way set associative) 

A special cache used to hold floating-point data that can be fetched well ahead of use. This 
cache increases the effective size of the primary data cache when executing floating-point 
programs, and provides specific hardware support for speculative loads, including both 
software and hardware data prefetch operations. 


2 KB Write Cache Memory (4-way set associative) 

A special cache used to coalesce data being stored back to memory. By reducing the number 
of separate store operations needed, effectively increases the memory bandwidth of the 
processor. 


Non-cacheable Store Compression 

The UltraSPARC IIi processor uses a 16-byte buffer to merge adjacent non-cacheable stores 
into a single external data transaction, greatly increasing store bandwidth to the graphics 
frame buffer. In addition, a flow control signal is available through the Graphics Status 
Register that allows software to interrogate a FIFO status signal on the graphics card, without 
requiring completion of a non-cacheable read to the device. This prevents stalling due to 
waiting for prior non-cacheable stores to be pushed to the device, and eliminates bubbles in 
the store throughput due to the pipeline depth between the processor and the graphics device. 


1 MB Unified Secondary Cache (4-way set associative) 

This large, on-chip L2 cache buffers the impact of L1 cache misses by providing fast, local 
access to a much larger pool of instructions and data than will fit into the several L1 caches. 
The effect is to substantially reduce the overall latency of memory operations. The tags for 
the L2 cache are protected by parity checking, while data in the cache is protected by full 
ECC, providing single-bit error correction and double-bit error detection. The L2 cache uses 
a write-back policy to reduce store traffic to main memory. Any uncorrectable double-bit 
errors are marked on write-back, so they will not propagate to other processors in an MP 
configuration. 


JBUS Interface 

A Sun-proprietary system interface new to the UltraSPARC IIi processor, developed to 
provide a combination of the high performance expected of Sun systems with the low cost 
demanded by the desktop and entry-level server marketplaces. A companion JIO chip is 
available from Sun Microsystems. In addition to supporting the shared address/data JBus 
itself, the companion chip also provides support for up to 2 industry-standard PCI buses, as 
well as for Sun’s proprietary UPA64S graphics bus (in place of the secondary PCI bus). 
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SDRAM Controller 

Provides direct connectivity of the processor to main memory through a 2-channel DDR 
SDRAM interface. Full ECC protection is provided on all stored memory data, and 
transactions on the memory/address bus are protected by parity checking. In the interests of 
simplicity, any system or DRAM-related, non-correctable errors are handled as deferred 
traps. 


Low Power Operating Modes 

The UltraSPARC Ili processor features low-power modes. When signalled to conserve 
power, the on-chip Clock Control Unit instantaneously switches the processor’s clock rate to 
lower power modes. 
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CHAPTER 2 





UltraSPARC IIIi Processor in a System 





The UItraSPARC IIIi processor can reside either on the system motherboard itself or in a 
separate module attached to the motherboard. The UltraSPARC IIIi processor is intended to 
operate with a special support bridge chip that provides I/O functions (called “JIO”). The 
UItraSPARC IIIi processor and its companion I/O chip can be used to scale systems from a 
minimum 1-way desktop or blade configuration up to a 4-way stand-alone server. 





2.1 


21.1 


System Configurations 


The UltraSPARC IIIi processor is designed to operate efficiently in 1-way, 2-way, or 4-way 
systems. 


Four-Processor System 


FIGURE 2-1 illustrates a typical configuration for a high-performance, 4-way, entry-level 
server. This system incorporates 4 UltraSPARC Ili processors and two companion JIO chips 
(configured as master-slave) to provide maximum I/O bandwidth. In the system shown, JBUS 
uses a “Bell Repeater”, a bit-sliced pipeline register chip to reduce loading on JBUS. A lower 
cost 4-way system with half the bandwidth can be build using a single master JIO chip. 
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FIGURE 2-1 Four-Processor System with the UltraSPARC IIIi Processor 


Note that, in the configuration shown, four possible JBUS segments, JBUS #0, JBUS #1, 
JBUS #2 and an optional JBUS #3, propagate through the Bell Repeater. The Bell Repeater 
is only needed when the JBUS is required to run at maximum frequency with more than three 
loads, to reduce loading on the JBUS. The Bell Repeater forwards the signals from each of 
the four segments of the JBUS on to the other three segments. Propagating JBUS signals 
through the Bell Repeater introduces a one cycle delay, i.e., any signals the Bell Repeater 
receives in one cycle. it forwards in the next. The Bell Repeater operates entirely 
automatically, i.e., it requires no control signals. 
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2.1.2 Two-Processor System 


FIGURE 2-2 illustrates a typical configuration for inexpensive 2-way desktops or servers based 
on the UltraSPARC IIli processor. This system incorporates 2 UltraSPARC IIIi processors 
with two companion JIO chips. Since this configuration, like the 4-way system, may involve 
placing 4 loads on the JBUS, it also requires addition of a Bell Repeater to achieve maximum 
JBUS performance. In the 4-load configuration shown, however, no Bell Repeater is needed, 
since the JBUS in this example has been designed to run lower than maximum frequency. 
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FIGURE 2-2 Two-Processor System with the UltraSPARC IIi Processor 
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N3 One-Processor System 


FIGURE 2-3 illustrates a typical configuration for a minimum-cost, 1-way system based on the 
UltraSPARC IIIi processor. This system involves no Bell Repeater and only 1 JIO chip. To 

reduce cost still further, note that the UltraSPARC IIIi processor can be configured to use a 
minimum memory of only two DIMMs on the DDR interface. In this sort of cost optimized 
single processor configuration, PCI slots are only provided where PCI devices can be added 


to a system. 
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FIGURE 2-3 One-Processor System with the UltraSPARC IIIi Processor 
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2.2 


JBUS Interface 


The UltraSPARC III processor has a companion JIO chip that features a 183-pin interface to 
connect to the JBUS. The JBUS is a 16-byte (128-bit), split transaction, shared address/data 
bus. 
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Memory System 


The memory system consists of the Memory Control Unit (MCU) in the processor, and two 
channels of DDR Synchronous DRAM memory. Each channel supports either one or two 
registered DIMMs, allowing systems to be configured with less memory (for lower cost) or 
more memory (for higher performance). Each channel has an address/ control bus as well as 
an 8-byte data bus (plus 1 byte for ECC check bits). Clock buffering with a PLL is provided 
on the DIMMs. 


Since both memory channels are controlled identically by the memory controller, DIMMs 
always must be loaded in pairs. Each DIMM pair consists of two 72-bit DDR SDRAM 
DIMMs. Since each DIMM could be dual sided (single/double), there are a maximum of four 
data loads per memory channel. 


The UItraSPARC IIIi processor modules have a total of four DIMM slots. In order, these are 
termed 1A, 1B, 2A, 2B. DIMMs 1A and 2A correspond to memory channel 1. DIMMs 1B 
and 2B correspond to memory channel 2. DIMM pair #1 contains DIMMs 1A and 1B. 
DIMM pair #2 contains DIMMs 2A and 2B. FIGURE 2-4 summarizes the high level 
architecture of the UItraSPARC IIIi memory system, including placement of the four 
DIMMs. 


Each cache line is split across the DIMMs in memory channel 1 and memory channel 2. In 
FIGURE 2-4, DIMM 14A belongs to memory channel 1 and DIMM 1B belongs to memory 
channel 2. Similarly, DIMM 2A belongs to memory channel 1 and DIMM 2B belongs to 
memory channel 2. 


In exactly the same way, each External Bank of memory is split across the two memory 
channels. As shown in FIGURE 2-4, External Banks 0 and 1 are split across DIMM 1A and 
DIMM 1B, and External Banks 2 and 3 are split across DIMM 2A and DIMM 2B. 


Each External Bank contains four Internal Banks. The memory controller pipelines requests 
to memory, making use of all 16 of the internal memory banks available (4 External Banks 
times 4 Internal Banks each), when all DIMM slots are fully loaded. 
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FIGURE 2-4 DDR Memory System Architecture 





2.4 Power Management 


The UltraSPARC IIli processor features two low power modes: a 1/2 speed mode and a 1/32 
speed mode for clock operation. 
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CHAPTER 3 





UltraSPARC IIIi Processor Architecture 
Basics 


The UltraSPARC IIli processor is a high-performance, highly-integrated, 4-way superscalar 
processor. In addition to wide parallel instruction dispatch to exploit instruction-level 
parallelism in code, the processor is designed to offer high clock speeds. To reduce 
instruction execution latencies, the processor incorporates on-chip level-1 instruction and 
data caches, a 1 MB unified level-2 cache, a memory controller, and large, flexible memory 
management units (MMUs). The processor was designed specifically to work in inexpensive 
desktop systems and entry-level servers, in configurations ranging from 1-4 processors. 


The UltraSPARC IIli processor also offers a number of performance enhancements over 
previous UltraSPARC processors. The processor incorporates multiple data prefetching 
mechanisms to enable long latency load operations to be overlapped with earlier operations. 
The processor offers an enhanced data memory management unit (D-MMU) with 3 separate 
TLBs providing a total of 1040 entries, and flexible support for page sizes ranging from 

8 KB up to 4 MB, enabling the processor to effectively map both small and large memory 
systems. 





3.1 


Component Overview 


The processor includes a high-performance, instruction fetch engine, called the instruction 
issue unit, that is decoupled from the rest of the pipeline by a 16-entry instruction buffer. 
Four instructions at a time are fetched from the level-1 instruction cache and queued for issue 
in the instruction buffer. Up to 4 instructions in a clock cycle can be steered from this queue 
into 6 execution buffers. Up to 6 instructions in a clock cycle can be dispatched from the 6 
execution buffers into the 6 parallel execution units in UltraSPARC IIIi processor: 2 integer 
ALUS, 1 branch unit, 1 load/store unit (also handles certain special operations, like integer 
multiplication and division), 1 floating-point add/subtract unit, and 1 floating-point multiply/ 
divide unit. The two floating-point units also handle the specialized SIMD VIS instructions 
for accelerating graphics, media, and network functions. 
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In addition to a 32 KB primary instruction cache, a 64 KB primary data cache, an instruction 
fetch engine, a 16-entry instruction buffer, and the 6 parallel execution units, the processor 
also integrates on-chip a 1 MB L2-cache, a 2 KB prefetch cache, a 2 KB write cache, an I/O 
interface (to the JBUS), and a memory controller. FIGURE 3-1 shows a simplified block 
diagram of the UltraSPARC IIIi processor. 


FIGURE 3-1. UltraSPARC IIi Processor Architecture 
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3.1.1 


Instruction Fetch and Buffering 


The instruction issue unit in the UltraSPARC IIIi processor is responsible for fetching, 
queuing, and steering instructions as appropriate to one of the six parallel execution units 
included in the UltraSPARC IIIi processor design. Up to four instructions are fetched and 
decoded at a time. Assuming the fetch request hits in the level-1 instruction cache (and 
certain other conditions are met, e.g., the instruction queue is not full), instruction fetching is 
possible in every clock cycle. If a fetch request misses in the level-1 instruction cache, a fill 
request is sent to the lower memory hierarchy for the 32-byte line containing the missing 
instruction block. 


The instruction cache uses a 32-byte line, containing 8 fixed-width 4-byte SPARC 
instructions. The unified L2 cache uses a 64-byte line. If the instruction request hits in the 
first half of an L2 cache line, the second half of that line is also fetched, and placed in a 
special 32-byte Instruction Prefetch Buffer (IPB), accessed in parallel with the instruction 
cache. This precaution avoids a potential L1 cache miss, in those cases where instruction 
fetching does move on sequentially to use the next group of 8 instructions. 


The UltraSPARC IIli processor instruction cache contains 1K lines, with a total capacity of 
8,192 instructions. Cache lines are virtually indexed but physically tagged. The cache is 4- 
way set-associative. It requires 2 cycles of latency to fetch an item, but access is pipelined, so 
sequential requests have single cycle throughput, after the two cycle delay for the first item is 
satisfied. Other cache features besides the usual data and tag arrays include a microtag, 
predecode bits, a Load Prediction Bit (LPB), and a snoop tag array. The microtag uses 8 bits 
of virtual address to enable fast way-selection of a potentially matching cache line, without 
waiting for the physical address translation to complete. The predecode bits include 
information about which pipeline each instruction will be issued to, and other information to 
optimize execution. The LPB is used to dynamically learn those load instructions that 
frequently see a read-after-write (RAW) hazard with preceding stores. The snoop tag is a 
copy of the tags dedicated for snoops caused by either stores from the same, or different, 
processors. The instruction cache in the UltraSPARC IIi processor is kept completely 
coherent so the cache never needs to be flushed. 


The instruction fetch engine is also dependent upon control transfer instructions such as 
branches and jumps. The UltraSPARC IIIi processor uses a 16K-entry branch predictor to 
predict the fetch direction of conditional branches. For branches that are either known to be 
taken or predicted taken, the branch target must be determined. For PC relative branches, the 
target of the branch is computed. This adds a one-cycle penalty to the branch taken case, but 
avoids any penalties from target misprediction. For predicting the target of return instructions 
an 8-entry Return Address Stack (RAS) is used. For other indirect branches (branches whose 
targets are determined by a register value), the software can provide a branch target 
prediction with a jump target preparation instruction. 


The 16-entry instruction buffer decouples the front-end instruction fetch from the back-end 

instruction execution, allowing these two parts of the pipeline to operate at different rates. If 
more instructions are fetched than can be issued, an empty instruction buffer gradually fills 

up. If instruction fetch is interrupted by a taken branch penalty or an instruction cache miss, 
a full instruction buffer gradually drains, hiding some or all of the ensuing latency. 
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3.1.3 
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Execution Pipelines 


The UltraSPARC IIli processor has six parallel execution units. Buffered instructions can be 
issued to all six units in a single cycle, and sustained issue to any 4 of these units is possible. 
The six executions are: 


2 integer Arithmetic and Logic Units (ALU) 

1 Branch pipeline 

1 Load/store pipeline (also handles special instructions) 

1 Floating-point multiply pipeline (also handles SIMD instructions) 
1 Floating-point addition pipeline (also handles SIMD instructions) 


The ALUs perform integer addition and subtraction, logic operations, and shifts. These units 
have single-cycle latency and throughput. The branch pipeline handles all branch instructions 
and can resolve one branch each cycle. Load/store operations are discussed in the next 
section. The load/store pipeline also handles Integer multiplication and division. Integer 
multiplication has a latency of 6 to 9 cycles depending on the size of the operands. Division 
is also iterative and requires 40 to 70 cycles. 


The floating-point units each have 4-cycles of latency, but are fully pipelined (one instruction 
per cycle per pipeline). These pipelines handle double and single precision floating-point 
operations and a set of SIMD instructions that operate on 8 or 16-bit fields. Floating-point 
division and square root operations use the floating-point multiplication pipeline and are 
iterative computations. Floating-point division requires 17 cycle for single precision, 20 
cycles for double precision computations. Floating-point square root requires 23 cycles for 
single precision, 29 cycles for double precision computations. 


Load/Store Unit 


A load or store instruction can be issued each cycle to the load/store pipeline. The load/store 
unit consists of the load/store pipeline, a store queue, a data cache and a write cache. 


Integer loads of unsigned words and double words have a 2-cycle latency. All other loads 
have a 3-cycle latency. There is an 8-entry store queue to buffer stores. Stores reside in the 
store queue from the time they are issued until they complete an update to the write cache. 
The store queue can effectively isolate the processor from the latency of completing stores. If 
the store queue fills up, the processor will block on a subsequent store. 


The store queue allows successive separate stores to the same cache line to collect. For non- 
catchable stores (for example, stores to a graphics frame buffer), this function can greatly 
reduce the amount of store traffic generated, effectively raising the bandwidth to external 
devices. 
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3.1.3.1 


The UltraSPARC Illi processor supports store forwarding, the ability to pass data still in the 
store queue directly to a quickly following load that attempts to access the same target 
location in memory (a Read After Write or RAW hazard). Since 3 cycles of latency is 
required for a load to communicate with the store queue, the LPB bit in the instruction cache 
is used to force 2-cycle loads to issue as 3-cycle loads. If a 2-cycle load is not correctly 
predicted to have a RAW hazard, the load must be re-issued. 


The data cache holds 64 KB. Cache lines are virtually indexed but physically tagged. The 
cache is 4-way set-associative. It requires 2 cycles of latency to fetch an item, but access is 
pipelined, so sequential requests have single-cycle throughput. Like the instruction cache, the 
data cache uses 8-bit microtags to do way-selection based on virtual addresses. The update 
policy is write-through, no write-allocate. The line size is 32 bytes with no subblocking. The 
data cache only needs to be flushed if an alias is created using virtual address bit 13. VA[13] 
is the only virtual bit used to index the data cache. 


The write cache is a write-back cache used to reduce the amount of store bandwidth required 
to the L2-cache. It exploits both temporal and spatial locality in the store stream. The small 
(2 KB) structure achieves a store bandwidth equivalent to a 64 KB write-back data cache 
while maintaining TSO compatibility. The write cache is kept fully coherent with both the 
processor pipeline and the system memory state. The write cache is 4-way set-associative 
and has 64-byte lines. The write cache maintains dirty bits on a per byte basis. 


Data Prefetching Support 


The UltraSPARC IIi processor makes use of advanced data prefetching mechanisms in both 
software and hardware. Software prefetching allows compilers (of Java JITs) to explicitly 
expose the memory-level parallelism in programs and to schedule memory operations. There 
are a number of variations of software prefetches. Software prefetches can specify if the data 
should be brought into the processor for reading or for both reading and writing. Software 
can also specify if the data should be installed into the L2-cache, for data that will be reused 
frequently, or only brought into the prefetch cache. 


Hardware prefetching is an automatic facility that looks for common data sequences, and 
attempts to fetch ahead based on detected patterns. 


Prefetch mechanisms are used to both hide load-miss activity and overlap load misses to 
increase memory-level parallelism. Robust prefetch mechanisms that avoid as many load 
misses as possible are especially important for the UltraSPARC IIIi processor since load 
misses block program execution, i.e., on load misses, the processor waits for the load to 
complete before executing any other instructions. 


Specifically to benefit data-intensive floating-point programs, the UltraSPARC IIIi processor 
features a special prefetch cache. The prefetch cache is a small (2 KB) cache that is accessed 
in parallel with the data cache for floating-point loads. In effect, it expands the size of the 
data cache when executing floating-point programs, and can noticeably reduce load misses 
with a correspondingly favorable impact on performance. Floating-point load misses, 
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hardware prefetches and software prefetches bring data into the prefetch cache. The prefetch 
cache is 4-way set-associative and has 64-byte lines which are broken into two 32-byte 
subblocks with separate valid bits. The prefetch cache is write invalidate. 


Memory Management Units 


There are separate Memory Management Units (MMUs) for instruction and data address 
translation. MMUs have two primary functions: memory protection, preventing processes 
from accessing each other’s memory spaces, and address translation -- the conversion of 
virtual addresses in the processor’s logical 64-bit address space into real addresses in the 
system’s physical memory. The first time a virtual address is encountered, the processor traps 
to software to walk a set of page tables in memory to locate the corresponding physical 
address. Since the process of translating a virtual address into a physical address is slow, the 
MMUS contain a set of Translation Lookaside Buffers (TLBs). These are specialized caches 
used to store recently mapped pairs of virtual-physical addresses together with associated 
page protection and usage information. Since TLB lookup is fast (unlike the initial 
translation process itself), memory operations can proceed without interruption as long as 
their virtual address “hits” in a TLB. 


The instruction MMU contains two TLBs accessed in parallel. The first TLB is a 16-entry 
fully-associative TLB. This small TLB is perfectly flexible, in the sense that it can hold 
pages of various sizes (8K, 64K, 512 KB, or 4 MB), and pages can be either locked or 
unlocked. The second TLB is a 128-entry, 2-way set-associative TLB. This large TLB is used 
exclusively to hold unlocked pages of the “default” 8 KB size. 


The data MMU of the UltraSPARC IIIi processor is enhanced to provide more translation 
entries and to provide more support for using large pages for translation. It contains three 
TLBs accessed in parallel. The first TLB is a 16-entry, fully-associative TLB, identical in 
nature to the small TLB in the instruction MMU. The other two TLBs are both 512-entry, 2- 
way set-associative caches. Like the large TLB in the instruction MMU, these large data 
TLBs only store entries for unlocked pages. Unlike the large TLB in the instruction MMU, 
the large TLBs in the data MMU can be set to any of the four page sizes, although only 
pages of the same size can accessed/filled at a time (but multiple pages of that size can be 
handled at once). The two TLBs can be set to either both store pages of the same size, or 
each store pages of different sizes. 


Having the two large TLBs is very important for general use of large pages for translation, in 
systems that need to map large physical memories. One of the TLBs can be set for large 
pages (such as 4 MB pages) while the other can be set to the default page size (usually 8 KB 
pages). With this configuration the processor provides robust support for large pages. 
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Embedded Cache Unit (Level-2 Unified Cache) 


The UltraSPARC IIIi processor supports an on-chip 1 MB, 4-way set-associative Level 2 
cache. A separate, 4-way set-associative cache is used to store tags for the L2 cache. Tags are 
protected by parity checking, date is fully protected with error correcting code (ECC) that 
allows all single-bit errors to be corrected and double-bit errors to be detected and marked to 
prevent use. 


JBUS Interface Unit 


The UltraSPARC IIli processor communicates with the JIO chip through JBUS. All 
transactions with the JBUS are routed through the JBUS interface unit. The outgoing control 
logic arbitrates for issuing transactions and for driving data. The incoming control logic 
enqueues all transactions issued on the bus and accumulates snoop results from internal 
caches before driving data on the system bus. The error control logic handles error logging 
and trap generation. 


Memory Controller Unit 


The Memory Control Unit (MCU) handles all data transfers between the system and the main 
memory of the UltraSPARC IIIi processor. The MCU accepts read and write transactions 
from the ECU and JBU. The local memory supports up to 16 GB of DDR 266 MHz 
SDRAM. Data transfers between memory and the JBU are handled by the MCU. The local 
memory consists of two DDR channels each of which are composed of two 72-bit DIMMs. 
Nine bits of ECC are stored with each 16-bytes of data. The ECC is checked by the MCU 
when data is read from memory. The MCU also handles the memory refresh and Low Power 
operation of memory. 


A major goal of the MCU is to aggressively reduce memory latencies. Methods to reduce 
latency include the following: 

Allowing reads to bypass writes while preserving the system bus order 

Reads from the ECU are started speculatively before reaching the system bus 


Holding internal SDRAM banks open to reduce the latency due to row access 
strobe (RAS) 
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Processor Operating Modes 


The UltraSPARC IIli processor operates in various modes. 


Privileged Mode 


This mode is a “supervisor” mode. In this mode, the software is allowed to access both 
privileged and non-privileged registers and address space identifiers (ASIs). There are certain 
features of the processor that can be accessed only in privileged mode. Privileged mode 
execution typically is used by the kernel and operating system. 


Non-Privileged Mode 


This mode is a “non-supervisor” operating mode, in which programs are allowed to access 
only non-privileged registers and ASIs. If non-privileged software tries to access privileged 
registers or ASIs, exceptions are generated and handled by the operating system. Non- 
privileged mode execution is typically used by the application programmers. 


Reset and RED State 


The UltraSPARC IIIi processor can be reset using various mechanisms. This section deals 
with the reset and RED. state for the UltraSPARC IIIi processor. 


RED state Characteristics 


A processor enters RED. state in one of the following two ways: 


First, by trapping when already at the maximum trap level. 








Second, by setting PSTATE.RED. 








When the processor enters RED. state, it will clear the DCU Control Register, including 
enable bits for I-cache, D-cache, I-MMU, D-MMU, and virtual and physical watchpoints. 
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Note — Exiting RED state by writing zero to PSTATE. RED in the delay slot of a JMPL 
is not recommended. A non-cacheable instruction prefetch can be made to the JMPL target, 
which may be in a cacheable memory area. This condition could result in a bus error on 
some systems and cause an instruction_access_error trap. You can mask the trap by setting 
the NCEEN bit in the ESTATE_ERR_EN register to zero, but this approach will mask all 
noncorrectable error checking. Exiting RED. state with DONE or RETRY avoids the 
problem. 















































Resets 


Reset priorities from highest to lowest are power-on resets (POR, hard or soft), externally 
initiated reset (XIR), watchdog reset (WDR), and software-initiated reset (SIR). 


Power-on Reset (Hard Reset) 


A Power-on Reset (POR) occurs when the J POR L pin is activated and stays asserted until 
the processor is within its specified operating range. When the J POR L pin is active, all 
other resets and traps are ignored. POR has a trap type of 1 at physical address offset 0x20. 
Any pending external transactions are canceled. 


After POR, software must initialize values of certain registers and state that is unknown after 
POR. The following bits must be initialized before the caches are enabled: 


In the I-cache, valid bits must be cleared and microtag bits must be set so that each way 
within a set has a unique microtag value. 


In the D-cache, valid bits must be cleared and microtag bits must be set so that each way 
within a set has a unique microtag value. 


All L2-cache tags and data 


The I-MMU and D-MMU TLBs must also be initialized. The P-cache valid bits must be 
initialized before any floating-point loads are executed. 











Caution — Executing a DONE or RETRY instruction when TSTATE is uninitialized after a 
POR can damage the chip. The POR boot code should initialize TSTATE<3:0>, using wrpr 
writes, before any DONE or RETRY instructions are executed. 




















However, these operations can only be executed in privileged mode. Therefore, user code is 
not at the risk of damaging the chip. 
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System Reset (Soft Reset) 


A system reset occurs when the J RST L pin is activated. When the J_RST_L pin is active, 
all other resets and traps are ignored. System reset has a trap type of 1 at physical address 
offset 0x20. Any pending external transactions are canceled. 


Note — Memory refresh continues uninterrupted during a system reset. The system 
interface, L2-cache configuration, and memory controller configuration are preserved across 
a system reset. 


Externally Initiated Reset (XIR) 


An XIR is sent to the processor through the XIR transaction on the JBUS. It causes a 
SPARC-V9 XIR, which has a trap type 346 at physical address offset 0x60. XIR has higher 
priority than all other resets except Power-on Reset and System Reset. 


XIR affects only one processor, rather than the entire system. Memory state, cache state, and 
most Control Status Register state are unchanged. System coherency is not guaranteed to be 
maintained through an XIR reset. The saved PC and nPC will only be approximate because 
the trap is not precise with respect to pipeline state. 


Watchdog Reset (WDR) and error. state 


The processor enters error state when a trap occurs at TL = MAXTL. 


The processor automatically exits error_state using WDR. The processor signals itself 
internally to take a WDR and sets TT = 2. The WDR traps to the address at 

RSTVaddr + 0x40;5. WDR sets the processor in a state where it is prepared for diagnosis of 
failures. 


WDR affects only one processor, rather than the entire system. CWP updates due to window 
traps that cause watchdog traps are the same as the no watchdog trap case. 


Software-Initiated Reset (SIR) 
An SIR is initiated by an SIR instruction within any processor. This per-processor reset has 


a trap type 4 at physical address offset 0x80. SIR affects only one processor, rather than the 
entire system. 
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RED state Trap Vector 





When the UltraSPARC IIIi processor processes a reset or trap that enters RED state, it 
takes a trap at an offset relative to the RED state trap vector base address (RSTVaddr); 
the base address is at virtual address FFFF FFFF F000 000046, which passes through to 
physical address 7FF F000 0000, 





Error Handling 


The UltraSPARC IIIi processor provides extensive support for detecting and correcting 
errors. Note that some errors may still be uncorrectable. 


Error Classes in Severity 
The classes of error in order of severity are as follows: 


1. Hardware-corrected errors. Hardware tries to correct the error automatically. A trap is 
generated to log the error conditions when the error is corrected to enable the actions for 
preventive maintenance. 


2. Software-correctable errors. Hardware does not correct the error automatically. Instead, 
it invokes a trap requesting the recovery software to correct the error. Corrective actions 
are expected from the recovery software. If recovery is successful, the system should 
continue the operation. 


3. Uncorrectable errors. By its nature the error is uncorrectable, and hardware invokes a 
trap to signal the occurrence of the error to appropriate recovery software. Depending on 
the condition under which the error occurs, the system may be able to recover from the 
error and continue operation. If not, it may be able to isolate the error to a particular 
process and terminate it. Otherwise, the software should shut down the system gracefully. 


4. Fatal errors. By its nature, the error indicates either loss of system consistency or a 
system interconnect protocol error. It is dangerous to continue operation in this situation 
because of the impending threat of a failure to maintain data integrity. Therefore, upon the 
detection of the error, the processor generates an error signaling sequence to its 
interconnect, expecting to be halted/reset by the system. System actions induced by the 
error signaling sequence are dependent on system implementation. 
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Corrective Actions 


Errors are handled by invocation of one of the following actions: 


- Reset-inducing error sequence. Any fatal error causes the error signaling sequence to 
induce a system reset. Some errors asynchronous to instruction execution may generate 
this error signaling sequence. 


- Precise traps. Most errors detected in the course of an instruction execution generate a 
precise trap. If the error is hardware correctable, software just logs it. If the error is 
software correctable, software corrects it before continuing execution. If the error is 
uncorrectable, software takes appropriate action. 


- Deferred traps. Some uncorrectable errors requiring immediate attention generate a 
deferred trap to request software intervention. The recovery software examines the 
recorded error information to determine the extent of the damage caused by the error. 
Depending on the observed effect, the system may need to be brought down, or it may 
continue to run when the effect is isolated within the user program. In any event, the error 
does not require immediate reset of the system. 


- Disrupting traps. An error asynchronous to instruction execution generates a disrupting 
trap to request logging and clearing. The error may already be corrected by hardware and 
may only require logging. If the error is software correctable, software corrects it before 
continuing execution. If the error is uncorrectable, software takes appropriate action. 


Errors Synchronous and Asynchronous to Instruction Execution 


Some errors can be detected asynchronously to instruction execution. Other errors are 
detected in the course of an instruction execution, that is, synchronous to instruction 
execution. Separate error recording mechanisms are used for synchronous and asynchronous 
errors. 


An error asynchronous to instruction execution is signaled by either a disruption or deferred 
trap to the processor, or through an error signaling sequence to system hardware which 
induces a system reset depending on the severity of the error. The errors signalled through a 
disrupting trap do not directly correspond to an instruction. Traps may or may not be 
recoverable. Errors signalled are meant to indicate either a loss of system consistency or a 
protocol error on system interconnect. 


An error detected in the course of an instruction execution is signalled through an error trap 
to the instruction, with additional information recorded in hardware. The trap is either 
precise or deferred. The program (process) affected by the error should be given a corrected 
response, or if the error is uncorrectable, the process should be terminated appropriately. 
Precise traps are used wherever possible. 
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Debug and Diagnostics Mode 


The UltraSPARC Ili processor provides interfaces for diagnostic access to most internal state 
of the processor. This is important for diagnosing, and when possible recovering from failures. 
There are several different diagnostic interfaces. All the diagnostic interfaces are accessible 
only from software running in privileged mode or from an external system controller in a 
server. All internal diagnostic and configuration registers are 8-bytes wide, and must be 
accessed as 8-byte units with 8-byte aligned addresses. 


There are a number of diagnostic registers that are mapped to internal ASI registers. These 
registers are accessed by load and store alternate ASI instructions that specify certain 
configurations of ASI numbers and virtual addresses. Diagnostic registers are provided for 
recording various fault conditions as well as important information and state associated with 
the fault to help diagnosis and possibly recover. 


For diagnostic and error recovery in the large memories on chip, such as caches, each element 
of these memory arrays can be individually read and written. Accesses are performed with load 
and store alternate ASIs that use specific ASIs that point to the memory array. These accesses 
can only be done by privileged software. 


Special ASI numbers are used for diagnostic accesses to structures where the virtual address is 
used to specify the portion of the structure to be read. Most structures can be directly read and 
many structures can also be directly written or guickly cleared. 


The UltraSPARC IlIi processor also provides a serial JTAG interface that can be used by a 
system controller for diagnostics. A system controller can perform a shadow scan where 
various configuration and diagnostic information is scanned out of the processor without 
interfering with the operation of the processor. The system controller can also use the JTAG 
interface to scan in information to configure or control various aspects of the processor. 


The JTAG interface also can be used to perform a full scan dump. When a full scan dump is 
performed, most of the flops in the processor are scanned out through a scan chain. A full scan 
dump is a destructive action and the processor must be reset after completion of the dump. The 
full scan provides an important tool for diagnosis of serious failures. 


For controlling diagnostics mode, there is a range of configuration registers, which can enable 
and disable many features of the processor. The configuration registers are only accessible in 
privileged mode. Some of the configuration registers are implemented as ASRs. These registers 
are accessible from the RDASR/WRASR interface. Most of the configuration registers are 
mapped as internal ASI registers. These registers are accessed by load and store alternate ASI 
instructions that specify certain configurations of ASI numbers and virtual addresses. 
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CHAPTER 4 





Instruction Execution 





This chapter focuses on the needs of compiler writers and others who are interested in 
scheduling instructions to optimize program performance. The chapter discusses the 
following topics: 


Section 4.1, “Introduction” 

Section 4.2, “Processor Pipeline” 
Section 4.3, “Pipeline Recirculation” 
Section 4.4, “Grouping Rules” 
Section 4.5, “Conditional Moves” 


Section 4.6, “Instruction Latencies and Dispatching Properties” 





4.1 


4.1.1 


Introduction 


The instruction at the memory location specified by the program counter (PC) is fetched and 
then executed, annulled, or trapped. Instruction execution may change program-visible 
processor and/or memory state. As a side effect of its execution, new values are assigned to 
the PC and the next program counter (nPC). 


An instruction may generate an exception if it encounters some condition that makes it 
impossible to complete normal execution. Such an exception may in turn generate a precise 
trap. Other events may also cause traps: an exception caused by a previous instruction (a 
deferred trap), an interrupt or asynchronous error (a disrupting trap), or a reset request (a 
reset trap). If a trap occurs, control is vectored into a trap table. 


NOP, Neutralized, and Helper Instructions 


The distinction between NOP and neutralized instructions is subtle. 
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4.1.1.2 
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NOP Instruction 





The architected NOP instruction is coded as a SETHI instruction with the destination register 
%g0. This instruction is groupable in the A0 or A1 pipeline. 


Neutralized Instruction 


Some instructions have no visible effects on the software. They have been de-implemented or 
assigned to not have an effect if the processor is in a certain mode. These instructions are 
often referred to as NOP instructions, but they are not the same as the NOP instruction in that 
they execute in the pipeline that is assigned to them. These are versions of instructions that 
have no effect because they only access the %g0 register and do not have any side effects. 
Hence, these instructions are functionally neutral. 


Helper Instructions 


Helper instructions are generated by the hardware to help in the execution or re-execution of 
an instruction. The hardware partitions a single instruction into multiple instructions that 
flow through the pipeline, consecutively. They have no software visibility and are part of the 
hardware function of the pipeline. 
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Processor Pipeline 


The processor pipeline consists of fourteen stages plus an extra stage that is occasionally 
used by the hardware. The pipeline stages are referred to by the following mnemonic 
single-letter names and are shown in TABLE 4-1. 


TABLE 4-1 Processor Pipeline Stages 





























Pipeline Stage Definition 
A Address generation 
P Preliminary Fetch 
F Fetch instructions from I-cache 
B Branch target computation 
I Instruction group formation 
J J: grouping 
R Register access (dispatch/dependency checking stage) 
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Pipeline Stage Definition 





Execute 





Cache 





Miss detect 





Write 





eXtend 





Trap 





Done 











co|2Mxzzintm 


Rather than executing the instructions in a single pipeline, several separate pipelines are each 
dedicated to execution of a particular class of instructions. The execution pipelines start after 
the R-stage of the pipeline. Some instructions take a cycle or two to execute, others take a 
few cycles within the pipeline. As long as the execution fits within the fixed pipeline depth, 
execution can in general be fully pipelined. Some instructions have extended execution times 
that sometimes vary in duration depending on the state of the processor. 


The following sections provide a stage-by-stage description of the pipeline. Chapter 3 
“UltraSPARC IlIi Processor Architecture Basics" describes the functions of the various 
execution units. This chapter explains how the pipeline operates the execution units to 
process the instructions. 


FIGURE 4-1 on page 34 illustrates each pipeline stage in detail and the relationship between 
high level, large architectural structures. 
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FIGURE 4-1 Instruction Pipeline Diagram 


34 UltraSPARC Illi Processor User's Manual * June 2003 


4.2.1 


4.2.1.1 


4.2.1.2 


4.2.1.3 


Instruction Dependencies 


Instruction dependencies exist in the grouping, dispatching, and execution of instructions. 


Grouping Dependencies 


Up to four instructions can be grouped together for simultaneous dispatch. The number of 
instructions that can be grouped together depends on the consecutive instructions that are 
present in the instruction fetch stream, the availability of execution resources (execution 
units), and the state of the system. Instructions are grouped together to provide superscalar 
execution of multiple instruction dispatches per clock cycle. 


Some instructions are single instruction group instructions. These are dispatched by 
themselves one clock at a time as a single instruction in the group. 





Note — Pipeline Recirculation: During recirculation, the recirculation invoking instruction 
is often re-executed as a single group instruction and often with a helper instruction inserted 
into the pipeline by the hardware. Even groupable instructions are retried in a single 
instruction group. See Section 4.3 “Pipeline Recirculation” on page 41 for details. 





Dispatch Dependencies 


Instructions can be held at the R-stage for many different reasons, including: 
Working register operand is not available 
Functional Unit is not available 
Store-load sequence is in progress (atomic operation) 


When instructions are held at the dispatch stage, the upper pipeline continues to operate until 
the instruction buffer is full. At that point, the upper pipeline stalls. 


During recirculation, the recirculation invoking instruction is held at the dispatch stage until 
its execution dependency is resolved. 


Execution Dependencies 

The pipeline assumes all load instructions will hit in a primary cache, allowing the pipeline 

to operate at full speed. There are two occurences that will recirculate the pipeline: 
D-cache Miss 


Load requires data to be bypassed from an earlier store that has not completed and does 
not meet the criteria for read-after-write data bypassing 
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Instruction-Fetch Stages 


The instruction-fetch pipeline stages A, P, F, and B are described below. 


A-stage (Address Generation) 


The address stage generates and selects the fetch address to be used by the Instruction Cache 
(I-cache) in the next cycle. The address that can be selected in this stage for instruction 
fetching comes from several sources including: 


Sequential PC 

Branch target (from B-stage) 
Trap target 

Interrupt 

Predicted return target 

Jmpl target 


Resolved branch/Jmpl target from execution pipeline 


P-stage (Preliminary Fetch) 


The preliminary fetch stage starts fetching four instructions from the I-cache. Since the 
I-cache has a two-cycle latency, the P-stage and the F-stage are both used to complete an 
I-cache access. Although the I-cache has a two-cycle latency, it is pipelined and can access a 
new set of up to four instructions every cycle. The address used to start an I-cache access is 
generated in the previous cycle. 


The P-stage also accesses the Branch Predictor (BP), which is a small, single-cycle access 
SRAM whose output is latched at the end of the P-stage. The BP predicts the direction of all 
conditional branches, based on the PC of the branch and the direction history of the most 
recent conditional branches. 


F-stage (Fetch) 
The F-stage is used for the second half of the I-cache access. At the end of this stage, up to 


four instructions from an I-cache line (32-bytes) are latched for decode. An I-cache fetch 
group is not permitted to cross an I-cache line (32-byte boundary). 
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B-stage (Branch Target Computation) 


The B-stage is the final stage of the instruction-fetch pipeline, A-P-F-B. In this stage, the 
four fetched instructions are first available in a register. The processor analyzes the 
instructions, looking for Delayed Control Transfer Instructions (DCTI) that can alter the path 
of execution. It finds the first DCTI, if any, among the four instructions and computes (if PC 
relative) or predicts (if register based) its target address. If this DCTI is predicted taken, the 
target address is passed to the A-stage to begin fetching from that stream; if predicted not 
taken, the target is passed on to the CTI queue for use in case of mispredict. Also in the 
B-stage, the computation of the hit or miss status of the instruction fetch is performed, so 
that the validity of the four instructions can be reported to the instruction queue. 





In the case of an I-cache miss, a request is issued to the L2-cache and all the way out to 
memory if needed to get the required line. The processor includes an optimization, where 
along with the line being fetched, the subsequent line (32-bytes) is also returned and placed 
into the instruction prefetch buffer. A subsequent miss that can get its instructions from the 
instruction prefetch buffer will behave like a fast miss. 


Instruction Issue and Queue Stages 


The I-stage and J-stage correspond to the enqueueing and dequeuing of instructions from the 
instruction queue. The R-stage is where instruction dependencies are resolved. 


I-stage (Instruction Group Formation) 


In the I-stage, the instructions fetched from the I-cache are entered as a group into the 
instruction queue. The instruction queue is four instructions wide by four instruction groups 
deep. The instruction may wait in the queue for an arbitrary period of time until all earlier 
instructions are removed from the queue. 


The instructions are grouped to use up to four of the execution pipelines, shown in TABLE 4-2. 


TABLE 4-2 Execution Pipelines 




















Pipeline Description 

A0 Integer ALU pipeline 0 

AI Integer ALU pipeline 1 

BR Branch pipeline 

MS Memory/Special pipeline 

FGM Floating-point/VIS multiply pipeline (with divide/square root pathway) 
FGA Floating-point/VIS add ALU pipeline 
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J-stage (Instruction Group Staging) 


In the J-stage, a group of instructions are dequeued from the instruction queue and prepared 
for being sent to the R-stage. If the R-stage is expected to be empty at the end of the current 
cycle, the group is sent to the R-stage. 


R-stage (Dispatch and Register Access) 


The integer working register file is accessed during the R-stage for the operands of the 
instructions (up to three) that have been steered to the AO, Al, and MS pipelines. At the end 
of the R-stage, results from previous instructions are bypassed in place of the register file 
operands, if required. 


Up to two floating-point or VIS instructions are sent to the Floating-Point/VIS Unit in this 
stage. 


The register and pipeline dependencies between the instructions in the group and the 
instructions in the execution pipelines are calculated concurrently with the register file 
access. If a dependency is found, the dependent instruction and any older instruction in the 
group is held in the R-stage until the dependency is resolved. 


S-stage (Normally Bypassed) 


The S-stage provides a 1-entry buffer per pipeline in cases when the R-stage is not able to 
take a new instruction. 


Execution Pipeline 


The execution pipeline contains the E, C, M, W, and X stages. 


Integer Instruction Execution: E-stage (Execute) 


The E-stage is the first stage of the execution pipelines. Different actions are performed in 
each pipeline. 


Integer instructions in the AO and A1 pipelines compute their results in the E-stage. The 
instructions include most arithmetic, all shift, and all logical instructions. Their results are 
available for bypassing to dependent instructions that are in the R-stage, resulting in 
single-cycle execution for most integer instructions. The AO and A1 pipelines are the only 
two sources of bypass results in the E-stage. 
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Other integer instructions are steered to the MS pipeline and, if necessary, are sent with their 
operands to the special execution unit in this stage. They can start their execution during the 
E-stage, but will not produce any results to be bypassed until the C-stage or the M-stage. 


Load instructions steered to the MS pipeline start accessing the D-cache or P-cache during 
the E-stage. The D-cache features Sum Addressed Memory (SAM) decode logic that 
combines the arithmetic calculation for the virtual address with the row decode of the 
memory array to reduce look-up time. The virtual address is computed in the E-stage for 
translation lookaside buffer (TLB) access and possible access to the P-cache. 


Floating-point and VIS instructions access the floating-point register file in the E-stage to 
obtain their operands. At the end of the E-stage, the results from previous completing 
floating-point/VIS instructions can be bypassed to the E-stage instructions. 


Conditional branch instructions in the BR pipeline resolve their directions in the E-stage. 
Based on their original predicted direction, a mispredict signal is computed and sent to the 
A-stage for possible refetching of the correct instruction stream. 





JMPL and RETURN instructions compute their target addresses in the E-stage of the MS 
pipeline. The results are sent to the A-stage to start fetching instructions from the target 
stream. 


C-stage (Cache) 


The D-cache delivers results for doubleword (64-bit) and unsigned word (32-bit) integer 
loads in the C-stage. The D-TLB access is initiated in the C-stage and proceeds in parallel 
with the D-cache access. For floating-point loads, the P-cache access is initiated in the 
C-stage. The results of the D-TLB access and P-cache access are available in the M-stage. 


Special instruction unit results are produced at the end of this stage and can be bypassed to 
waiting dependent instructions in the R-stage—minimum two-cycle latency for SIU 
instructions. The integer pipelines, AO and A1, write their results back to the working 
register file in the C-stage. 


The C-stage is the first stage of execution for floating-point and VIS instructions in the FGA 
and FGM pipelines. 


M-stage (Miss) 


D-cache misses are determined in the M-stage by a comparison of the physical address from 
the D-TLB to the physical address in the D-cache tags. If the load requires additional 
alignment or sign extension (such as signed word, all halfword, and all byte loads), it is 
carried out in this stage, resulting in a three-cycle latency for those load operations. This 
stage is used for the second execution cycle of floating-point and VIS instructions. Load data 
is available to the floating-point pipelines in the M-stage. 
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W-stage (Write) 


In the W-stage, the MS integer pipeline results are written into the working register file. The 
W-stage is also used as the third execution cycle for floating-point and VIS instructions. The 
results of the D-cache miss are available in this stage and the requests are sent to the 
L2-cache if needed. 


X-stage (Extend) 


The X-stage is the last execution stage for most floating-point operations (except divide and 
square root) and for all VIS instructions. Floating-point results from this stage are available 
for bypass to dependent instructions that will be entering the C-stage in the next cycle. 


Trap and Done Stages 


This section describes the stages that interrupt or complete instruction execution. 


The results of operations are bypassed and sent to the working register file. If no traps are 
generated, then they are successfully pipelined down to the architectural register file and 
committed. If a trap or recirculation occurs, then the architectural register file (contains 
committed data) is copied to the working register in preparation for the instructions to be 
re-executed. 


T-stage (Trap) 


Traps, including floating-point and integer traps, are signalled in this stage. The trapping 
instruction, and all instructions younger than the trapping instruction must invalidate their 
results before reaching the D-stage to prevent their results from being erroneously written 
into the architectural or floating-point register files. 


D-stage (Done) 


Integer results are written into the architectural register file in this stage. At this point, they 
are fully committed and are visible to any traps generated from younger instructions in the 
pipeline. 


Floating-point results are written into the floating-point register file in this stage. These 
results are visible to any traps generated from younger instructions. 
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4.3 


Pipeline Recirculation 


When a dependency is encountered in or before the dispatch R-stage, then the pipeline is 
stalled. Most dependencies, like register or FV dependencies are resolved in the R-stage. 
When a dependency is encountered after the dispatch R-stage, then the pipeline is 
recirculated. Recirculation involves resetting the PC back to the recirculation invoking 
instruction. Instructions older than the dependent instruction continue to execute. The 
offending instructions and all younger instructions are recirculated. The offending instruction 
is retried and goes through the entire pipeline again. 


Upon recirculation, the instruction responsible for the recirculation becomes a single-group 
instruction that is held in the R-stage until the dependency is resolved. 


Load Instruction Dependency 


In the case of a load instruction miss in a primary cache, the pipeline recirculates and the 
load instruction waits in the R-stage. When the data is returned in the D-cache fill buffer, the 
load instruction is dispatched again and the data is provided to the load instruction from the 
fill buffer. The pipeline logic inserts two helpers behind the load instruction to move the data 
in the fill buffer to the D-cache. The instruction in the instruction fetch stream, after the load 
instruction, follows the helpers and will re-group with younger instructions, if possible. 





4.4 


Grouping Rules 


Grouping rules are made before going into R-stage. A group is a collection of instructions 
with no resource constraints that will limit them from being executed in parallel. 


Instruction grouping rules are necessary for the following reasons: 
- Maintain the instruction execution order 
- Each pipeline runs a subset of instructions 


- Resource dependencies, data dependencies, and multicycle instructions require helpers 
(NOPs) to maintain the pipelines 


Before continuing, the following terms that apply to instructions are defined as: 
break-before: The instruction will always be the first instruction of a group. 


break-after: The instruction will always be the last instruction of a group. 
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single-instruction group (SIG): The instruction will not be issued with any other 
instructions in the group. (SIG is sometimes shortened herein to “single-group.”) 


instruction latency: The number of processor cycles after dispatching an instruction from 
the R-stage that a following data-dependent instruction can dispatch from the R-stage. 


blocking, multicycle: The instruction reserves one or more of the execution pipelines for 
more than one cycle. The reserved pipelines are not available for other instructions to issue 
into until the blocking, multicycle instruction completes. 


Execution Order 


Rule: Within the R-stage, some of the instructions can be dispatched and others cannot. 
If an instruction is younger than an instruction that is not able to dispatch, then the 
younger instruction will not be dispatched. 


“Younger” and “older” refer to instruction order within the program. The instruction that 
comes first in the program order is the older instruction. 


Integer Register Dependencies to Instructions in the 
MS Pipeline 


Rule: If a source register operand of an instruction in the R-stage matches the 
destination register of an instruction in the MS pipeline's E-stage, then the instruction 
in the R-stage may not proceed. 


The MS pipeline has no E-stage bypass. 


If an operand of an instruction in the R-stage matches the destination register of an 
instruction in the MS pipeline's C-stage, then the instruction in the R-stage may not proceed 
if the instruction in the MS pipeline's C-stage does not generate its data until the M-stage. 
For example, LDSB does not have the load data until the M-stage, but LDX has its data in the 
C-stage. Thus, LDX would not cause an interlock, but LDSB would. 


Most instructions in the MS pipeline have their data by the M-stage, so there is no 
dependency check on the MS pipeline's M-stage destination register. In the case of 
multicycle MS instructions, the data is always available by the M-stage as the last of the 
instructions passes through the pipeline. 
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Helpers 


Sometimes an instruction, as part of its operation, requires multiple flows in the pipeline. 
These extra flows after the initial instruction flow are called helper cycles. The only pipeline 
that executes such instructions is the MS pipeline. If an instruction requires a helper, that 
helper is generated in the R-stage. The help generation logic generates as many helpers as the 
instruction requires. 


Most of the time the logic determines the number of helpers by examining the opcode. 
However, some recirculate cases run the recirculated instruction differently than the original 
flow down the pipeline, and some instructions, like integer multiply and divide, require 
variable numbers of helpers. Some helper counts are determined by I/O and memory 
controllers and system devices. For example, the D-cache unit requires helpers as it 
completes an atomic memory instruction. 


Rule: Instructions requiring helpers are always break-after. 


There can be no instruction in a group that is younger than an instruction that requires 
helpers. Another way of saying this is “an instruction that requires helpers will be the 
youngest in its group.” This rule preserves the in-order execution of the integer instructions. 


Rule: Helpers block the pipeline. 


Helpers block the pipeline from executing other instructions; thus, instructions with helpers 
are blocking. 


Rule: Helpers are always single-group. 


A helper cycle is always alone in a group. No other instruction will ever be dispatched from 
the R-stage if there is a helper cycle in the R-stage. 


Integer Instructions Within a Group 


Rule: Integer instructions within a group are not allowed to write the same destination 
register. 


By not writing the same destination register at the same time, the bypass logic is simplified 
as well as the register file write-enable determination and potential Write After Write (WAW) 
errors. The instructions are break-before second destination is written. 


This rule applies only to integer instructions writing integer registers. Floating-point 
instructions and floating-point loads (done in the integer A0, Al, and MS pipelines) can be 
grouped so that two or more instructions in the same group can write the same floating-point 
destination register. Instruction age is associated with each instruction. The write from an 
older instruction is not visible, but the execution of the instruction might still cause a trap 
and set condition codes. 


There are no special rules concerning integer instructions that set condition codes and 
integer branch instructions. 
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Integer instructions that set condition codes can be grouped in any way with integer 
branches. In fact, any number instructions that set condition codes are allowed in any order 
relative to the branch, provided that they do not violate any other rules. No special rules 
apply to this specific case. Integer instructions that set condition codes in the A1 and A0 
pipelines can compute a taken/untaken result in the E-stage, which is the same stage in which 
the branch is evaluating the correctness of its prediction. The control logic guarantees that 
the correct condition codes are used in the evaluation. 


Same-Group Bypass 


Rule: Same-group bypass is disallowed, except store instructions. 


The group bypass rule states that no instruction can bypass its result to another instruction in 
the same group. The one exception to this rule is sfore. A store instruction can get its store 
data (rd), but not its address operands (rs1, rs2), from an instruction in the same group. 


Floating-Point Unit Operand Dependencies 


Latency and Destination Register Addresses 


Floating-point operations have longer latencies than most integer instructions. Moreover, 
floating-point sguare root and divide instructions have varying latencies depending on 
whether the operands are single precision or double precision. All the floating-point 
instruction latencies are four clock cycles (except for floating-point divide and sguare root 
and PDIST — PDIST). 


The operands for floating-point operations can either be single precision (32-bit) or double 
precision (64-bit). Sixteen of the double precision registers are each made up of two single 
precision registers. An operation using one of these double precision registers as a source 
operand may be dependent on an earlier single precision operation producing part of the 
register value. Similarly, an operation using one of the single precision registers as a source 
operands may be dependent on an earlier double precision operation, a part of which may 
produce the single precision register value. 
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Grouping Rules for Floating-Point Instructions 


Rule: Floating-point divide/square root is busy. 


The floating-point divide/square root unit is a non-pipelined unit. The Integer Execution Unit 
sets a busy bit for each of the two stages of the divide/square root and depends on the FGU 
to clear them. Only the first part of the divide/square root is considered to have a busy unit; 
therefore, once the first part is complete, a new floating-point divide/square root operation 
can be started. 


Rule: Floating-point divide/square root needs a write slot in the FGM pipeline. 


In the stage in which a divide/square root is moved from the first part to the last part, 
instructions must not be issued to the FGM pipeline. This constraint provides the write slot in 
the FGM pipeline so the divide/square root can write the floating-point register file. 


Rule: Floating-point store is dependent on floating-point divide/square root. 


The floating-point divide/square root unit has a latency longer than the normal pipeline. As a 
result, if a floating-point store depends on the result of a floating-point divide/square root, 
then the floating-point store instruction may not be dispatched until the floating-point 
divide/square root instruction has completed. 


Grouping Rules for VIS Instructions 


Rule: Graphics Status Register (GSR) Write instructions are break-after. 


The SIAM, BMASK, and FALIGNADDR instructions write the GSR. The BSHUFFLE and 
FALIGNDATA instructions read the GSR in their operation. Because of the GSR write 
latency, a GSR reader cannot be in the same group as a GSR writer unless the GSR reader is 
older than the GSR writer. The simplest solution to this dependency is to make all GSR write 
instructions break-after. 





Note — The WRGSR instruction is not included in this rule as a special case. The WRGSR 
instruction is already break-after by virtue of being a WRASR instruction. 


PDIST Special Cases 


PDIST-to-dependent-PDIST is handled as a special case with one-cycle latency. PDIST 
latency to any other dependent operation is a four-cycle latency. In addition, a PDIST cannot 
be issued if there is ST, block store (BST), or partial store instruction in the M-stage of the 
pipeline. PDIST issue is delayed if there is a store type instruction two groups ahead of it. 
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Grouping Rules for Register-Window Management 
Instructions 


Rule: Window changing instructions are single-group. 

















The window changing instructions SAVE, RESTORE, and RETURN are all single-group 
instructions. These instructions are never grouped with any other instruction. This rule 
greatly simplifies the tracking of register file addresses. 





Rule: Window changing instructions force bubbles after. 




















The window changing instructions SAVE, RESTORE, and RETURN also force a subsequent 
pipeline bubble. A bubble is distinct from a helper cycle in that there is nothing valid in the 
pipeline within a bubble. During the bubble, control logic transfers the new window from the 
Architectural Register File (ARF) to the Working Register File (WRF). 


Rule: FLUSHW is single-group. 


To simplify the Integer Execution Unit’s handling of the register file window flush, the 
F LUSHW instruction is single-group. 


Rule: SAVED and RESTORED are single-group. 

















To simplify the Integer Execution Unit’s window tracking, SAVED and RESTORED are 
single-group instructions. 








Grouping Rules for Reads and Writes of the ASRs 


Rule: Write ASR and Write PR instructions are single-group. 


WRASR and WRPR are always the youngest instructions in a group. This case prevents 
problems with an instruction being dependent on the result of the write, which occurs late in 
the pipeline. 


Rule: Write ASR and Write PR force seven bubbles after. 


To guarantee that any instruction that starts in the R-stage is started with the most up-to-date 
status registers, WRASR and WRPR force bubbles after they are dispatched. Thus, if a WRASR 
or a WRPR instruction is in the pipeline anywhere from the E-stage to the T-stage, no 
instructions are dispatched from the R-stage (bubbles are forced in). 


Rule: Read ASR and Read PR force up to six bubbles before (break-before multicycle). 


Many instructions can update the ASRs and PRs. Therefore, if an RDASR or RDPR 
instruction is in the R-stage and any valid instruction is in the integer pipelines from the 
E-stage to the X-stage, the UltraSPARC IIIi processor does not allow the RDASR and RDPR 
instructions to be dispatched. Instead, all pipeline states must wait to write the ASRs and 
privileged registers and then read them. 
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Grouping Rules for Other Instructions 


Rule: Block Load (BLD) and Block Store (BST) are single-group and multicycle. 
For simplicity in the Integer Execution Unit and memory system, BLD and BST are 
single-group instructions with helpers. 

Rule: FLUSH is single-group and seven bubbles after. 


To simplify the Instruction Issue Unit and Integer Execution Unit, the FLUSH instruction is 
single-group. This makes instruction cancellation and issue easier. FLUSH is held in the 
R-stage until the store queue and the pipeline from E-stage through D-stage is empty. 


Rule: MEMBAR (#Sync, #Lookaside, #StoreLoad, #Memissue) is single-group. 





To simplify the Integer Execution Unit and memory system, MEMBAR is a single-group 
instruction. MEMBAR will not dispatch until the memory system has completed necessary 
transactions. 





Rule: Software-initiated reset (SIR) is single-group. 


For simplicity, SIR is a single-group instruction. 


Rule: Load FSR (LDFSR) is single-group and forces seven bubbles after. 


For simplicity, LDFSR is a single-group instruction. 


Rule: DONE and RETRY are single-group. 





DONE and RETRY instructions are dispatched as a single-group. 





Rule: DONE and RETRY force seven bubbles after. 

















DONE and RETRY are typically used to return from traps or interrupts and are known as trap 
exit instructions. 


It takes a few cycles to properly restore the pre-trap state and the working register file from 
the architectural register file, so bubbles are forced after the trap exit instructions to provide 
the cycles to do it all. A new instruction is not accepted until the trap exit instruction leaves 
the pipeline (also known as D + 1). 
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Conditional Moves 


The compiler needs to have a detailed model of the implementation of the various 
conditional moves so it can optimally schedule code. TABLE 4-3 describes the implementation 
of the five classes of SPARC-V9 conditional moves in the pipeline. FADD and ADD 
instructions (shaded rows) are also described as a reference for comparison with the 
conditional move instructions. 


TABLE 4-3 SPARC-V9 Conditional Moves 






RD 
Instruction Latency Pipelines Used Cycles Groupable Dependency 


3 cycles FGA and BR 1 Yes icc-0 
FMOVfcc 3 cycles FGA and BR 1 Yes fcc-0 
FMOVr 3 cycles FGA and MS 1 Yes N/A 

4 cycles FGA 
1 cycle A0 or A1 1 Yes N/A 














FMOVicc 


















































MOVcc 2 cycles MS and BR 1 Yes icc-0 
MOVR 2 cycles MS and BR 1 Yes N/A 
Where: 


RD Latency — The number of processor cycles until the destination register is available for 
bypassing to a dependent instruction. 


Pipes Used — The pipeline that the instruction uses when it is issued. The pipelines are 
shown in TABLE 4-2. 


Busy Cycles — The number of cycles that the pipelines are not available for other 
instructions to be issued. A value of one signifies a fully pipelined instruction. 


Groupable — Whether instructions using pipelines, other than those used by the conditional 
move, can be issued in the same cycle as the conditional move. 


{i,f}CC Dependency — The number of cycles that a CC setting instruction must be 
scheduled ahead of the conditional move in order to avoid incurring pipeline stall cycles. 
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Instruction Latencies and Dispatching 
Properties 


In this section, a machine description is given in the form of a table (TABLE 4-5 on page 50) 
dealing with dispatching properties and latencies of operations. The static or nominal 
properties are modelled in the following terms (columns in TABLE 4-5 on page 50), which are 
discussed below: 


Latencies 

Blocking properties in dispatching 

Pipeline resources (A0, Al, FGA, FGM, MS, BR) 
Break rules in grouping (before, after, single-group) 


The pipeline assumes the primary cache will be accessed. The dynamic properties, such as 
the effect of a cache miss and other conditions, are not described here. 


Latency 


In the Latency column of TABLE 4-5 on page 50, latencies are minimum cycles at which a 
dependent operation (consumer) can be dispatched, relative to the producer operation, 
without causing a dependency stall or instructions to hold back in the R-stage to execute. 


Operations like ADDcc produce two results, one in the destination register and another in the 
condition codes. For such operations, latencies are stated as a pair x,y, where x is for the 
destination register dependence and y is for the condition code. 


A zero latency implies that the producer and consumer operations may be grouped together 
in a single group, as in {SUBcc, BE %icc}. 





Operations like UMUL have different latencies, depending on operand values. These are given 
as a range, min—max, for example, 6 — 8 in UMUL. Operations like LDFSR involve waiting for 
a specified condition. Such cases are described by footnotes and a notation like 32+ for 
CASA (meaning at least 32 cycles). 


Cycles for branch operations (like BPcc) give the dispatching cycle of the retiring target 
operation relative to the branch. A pair of numbers, for example 0, 8, is given, depending on 
the outcome of a branch prediction, where 0 means a correct branch prediction and 8 means 
a mispredicted case. 


Special cases, such as FCMP(s,d), in which latencies depend on the type of consuming 
operations, are described in footnotes (bracketed, for example, [1]). 
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Blocking 


The Blocking column of TABLE 4-5 gives the number of clock cycles that the dispatch unit 
waits before issuing another group of instructions. Operations like FDIVd (MS pipeline) 
have limited blocking property; that is, the blocking is limited to the time before another 
instruction that uses MS pipeline can be dispatched. Such cases are noted with footnotes. All 
pipelines block instruction dispatch when an instruction is targeted to them, but they are not 
ready for another instruction to be pipelined-in. 


Pipeline 


The Pipeline column of TABLE 4-5 specifies the resource usage. Operations like MOVcc 
require more than one resource, as designated by the notation MS and BR. The operation 
LDF can dispatch to either MS, AO, or Al as indicated. 


Break and SIG 


Grouping properties are given in columns Break and SIG (single-instruction group). In the 
Break column an entry can be “Before,” meaning that this operation causes a break in a 
group so that the operation starts a new group. Operations like RDCCR require dispatching to 
be stalled until all operations in flight are completed (reach D-stage); in such cases, details 
are provided in a footnote reference in the Break column. 


Operations like ALIGNADDR must be the last in an instruction group, causing a break in the 
group of type "After." 


Certain operations are not groupable and therefore are issued in single-instruction groups. A 
break “before” and “after” are implied for non-groupable instructions. 
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TABLE 4-5 UltraSPARC IIIi Processor Instruction Latencies and Dispatching Properties (1 of 6) 
Dispatch 
Blocking 
Instruction Latency After Pipeline Break SIG 
ADD 1 AO or Al 
ADDcc 1,0[1] A0 or A1 
ADDC 5 4 MS Yes 
ADDCcc 6, 5 [2] 5 MS Yes 
ALIGNADDR 2 MS After 
ALIGNADDRL 2 MS After 
AND 1 A0 or A1 
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Dispatch 
Blocking 
Instruction Latency After Pipeline Break SIG 
ANDcc 1, 0 [1] AO or Al 
ANDN 1 AO or Al 
ANDNcc 1,0 [1] AO or Al 
ARRAY(8,16,32) 2 MS 
Bice? 0, 8 [3] 0, 5 [4] BP 
BMASK 2 MS After 
BPcc 0, 8 [3] 0, 5 [4] BP 
BPR 0, 8 [3] 0, 5 [4] BP and MS 
BSHUFFLE 3 FGA Yes 
CALL label 0-3 [5] BP and MS 
CASA 32+ 31+ MS After 
CASXA 32+ 31+ MS After 
DONE? 7 Yes BP and MS Yes 
EDGE(8,16,32){L} 5 4 MS Yes 
EDGE(8,16,32)N 2 MS 
EDGE(8,16,32)LN 2 MS 
FABS (s,d) 3 FGA 
FADD (s,d) 4 FGA 
FALIGNDATA 3 FGA 
FANDNOT1{s} 3 FGA 
FANDNOT2(s) 3 FGA 
FAND(s) 3 FGA 
FBPfcc BP 
FBfccP BP 
FCMP (s,d) 1, 5 [6] FGA 
FCMPE (s,d) 1, 5 [6] FGA 
FCMPEQ(16,32) 4 MS and FGA 
FCMPGT(16,32) 4 MS and FGA 
FCMPLE(16,32) 4 MS and FGA 
FCMPNE(16,32) 4 MS and FGA 
FDIVd 20 (14) [6] 17 (11) [7] FGM 
FDIVs 17 (14) [6] 14 (11) [7] FGM 
FEXPAND 3 FGA 
FiTO(s,d) 4 FGA 
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Dispatch 

Blocking 
Instruction Latency After Pipeline Break SIG 
FLUSH 8 T BP and MS Before [8] Yes 
FLUSHW Yes MS Yes 
FMOV (s,d) 3 FGA 
FMOV (s,d) cc 3 FGA and BP 
FMOV (s,d) r 3 FGA and MS 
FMUL (s,d) 4 FGM 
FMUL8 (, SU, UL) x16 4 FGM 
FMUL8x16 (AL, AU) 4 FGM 
FMULD8 (SU, UL) x16 4 FGM 
FNAND {s } 3 FGA 
FNEG (s,d) 3 FGA 
FNOR(s) 3 FGA 
FNOT(1,2)(s) 3 FGA 
FONE(s) 3 FGA 
FORNOT (1,2) (s) 3 FGA 
FOR{s} 3 FGA 
FPACK(FIX, 16,32) 4 FGM 
FPADD(16, 16s, 32, 32s) 3 FGA 
FPMERGE 3 FGA 
FPSUB(16, 16s, 32, 32s) 3 FGA 
FsMULd 4 FGM 
FSORTd 29 (14) [6] 26 (11) [7] FGM 
FSORTs 23 (14) [6] 20 (11) [7] FGM 
FSRC(1,2)(s) 3 FGA 
F(s,d)TO(d,s) 4 FGA 
F(s,d)TOi 4 FGA 
F(s,d)TOx 4 FGA 
FSUB(s,d) 4 FGA 
FXNOR 3 FGA 
FXOR(s) 3 FGA 
FxTO(s,d) 4 FGA 
FZERO(s) 3 FGA 
ILLTRAP MS 
JMPL reg, 507 0-4, 9-10 [9] 0-3, 8-9 MS and BP 
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TABLE 4-5 UltraSPARC IIIi Processor Instruction Latencies and Dispatching Properties (4 of 6) 
Dispatch 
Blocking 

Instruction Latency After Pipeline Break SIG 

JMPL %i7+8,%g0 3-5, 10-12 [10] 2-4, 9-11 MS and BP 

JMPL %07+8, $g0 0-4, 9 [11] 0-3, 8 MS and BP 

LDDP 2 Yes MS After 

LDDAP 2 Yes MS After 

LDDF {A} 3 MS, AO, or Al 

LDF {A} 3 MS, AO, or Al 

LDFSRP [22] Yes MS Yes 

LDSB(A) 3 MS 

LDSH(A) 3 MS 

LDSTUB {A} 31+ 30+ MS After 

LDSW{A} 3 MS 

LDUB{A} 3 MS 

LDUH{A} 3 MS 

LDUW{A} 2 MS 

LDX {A} 2 MS 

LDXFSR [22] Yes MS Yes 
EMBAR #LoadLoad 12] MS Yes 
EMBAR #LoadStore 12] MS Yes 
EMBAR #Lookaside 13] MS Yes 
EMBAR #MemIssue 13] MS Yes 
EMBAR #StoreLoad 13] MS Yes 
EMBAR #StoreStore 12] MS Yes 
EMBAR #Sync 14] MS Yes 
OVcc 2 MS and BP 
OVfcc 2 MS and BP 
OVr 2 MS 
ULScc 6, 5 [2] 5 MS Yes 
ULX 6-9 5-8 MS After 

NOP na MS 

OR 1 AO or Al 

ORcc 1,0 [1] AO or Al 

ORN 1 AO or Al 

ORNcc 1, 0 [1] AO or Al 

PDIST 4 FGM 
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UltraSPARC IIIi Processor Instruction Latencies and Dispatching Properties (5 of 6) 


























































































































Dispatch 

Blocking 
Instruction Latency After Pipeline Break SIG 
POPC emulated 
PREFETCH{A} MS 
RDASI 4 MS Before [15] 
RDASR 4 MS Before [15] 
RDCCR 4 MS Before [15] 
RDDCRF 
RDFPRS 4 MS Before [15] 
RDPC 4 MS Before [15] 
RDPR MS Before [15] 
RDSOFTINTP 
RDTICK 4 MS Before [15] 
RDYD 4 MS Before [15] 
RESTORE 2 1 MS Before[16] | Yes 
RESTOREDF MS Yes 
RETRYP 2 Yes MS and BP After 
RETUR 2,9[17] 1,8 MS and BP Before [18] | Yes 
SAVE 2 1 MS Before [19] | Yes 
SAVEDF 2 Yes MS Yes 
SDIV 39 38 MS After 
SDIV(cc)P 40, 39 [2] 39 MS After 
SDIVX 71 70 MS After 
SETHI 1 A0 or A1 
SHUTDOWN [23] NOP MS NOP 
SIAM Yes MS Yes 
SIR Yes BP and MS Yes 
SLL(X) 1 AO or Al 
SMULP 6-7 5-6 MS After 
SMULccP 7-8, -6-7 [2] 6-8 MS After 
SRA{X} 1 AO or Al 
SRL{X} 1 A0 or A1 
STB{A} MS 
STBAR? [20] MS Yes 
STD(A)P 2 MS Yes 
STDF(A) MS 
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TABLE 4-5 UltraSPARC IIIi Processor Instruction Latencies and Dispatching Properties (6 of 6) 































































































Dispatch 

Blocking 
Instruction Latency After Pipeline Break SIG 
STF {A} MS 
STFSRF 9 MS Before [21] | Yes 
ST (H,W, X) (AJ MS 
STXFSR 9 MS Before [21] | Yes 
SUB 1 AO or Al 
SUBcc 1,0 [1] AO or Al 
SUBC 5 4 MS Yes 
SUBCcc 6, 5 [2] 5 MS Yes 
SWAP {A} 31+ 30+ MS After 
TADDcc 5 Yes MS Yes 
TSUBcc 5 Yes MS Yes 
Tce BR and MS 
UDIV? 40 39 MS After 
UDIVccP 41, 40 [2] 40 MS After 
UDIVX 71 70 MS After 
UMULP 6-8 5-7 MS After 
UMULccP 7-8, 6-7 [2] 6-8 MS After 
WRASI 16 BR and MS Yes 
WRASR 7 BR and MS Yes 
WRCCR 7 BR and MS Yes 
WRFPRS 7 BR and MS Yes 
WRPRP T BR and MS Yes 
wry? T BR and MS Yes 
XNOR 1 AO or Al 
XNORcc 1, 0 [1] AO or Al 
XOR 1 AO or Al 
XORcc 1,0 [1] AO or Al 























1. These operations produce two results: destination register and condition code ($icc, $xcc). The latency is one in the 
former case and zero in the latter case. For example, SUBcc and BE $icc are grouped together (zero latency). 


2. These operations produce two results: destination register and condition code ($icc, $xcc). The latency is given as a 
pair of numbers —m, n — for the register and condition code, respectively. When latencies vary in a range, such as in 
UMULcc, this range is indicated by pair— pair. 


3.  Latency is x, y for correct, incorrect branch prediction. It is measured as the difference in the dispatching cycle of the 
retiring target instruction and that of the branch. 
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Blocking cycles are x,y for correct, incorrect branch prediction. They are measured as the difference in the dispatching 
cycle of instruction in the delay slot (or target, if annulled) that retires and that of the branch. 


Native Call and Link with immediate target address (label). 


Latency in parentheses applies when operands involve IEEE special values (NaN, INF), including zero and illegal 
values. 


Blocking is limited to another FD operation in succession; otherwise, it is unblocking. Blocking cycles in parentheses 
apply when operands involve special holding and illegal values. 


Dispatching stall (7+ cycles) until all stores in flight retire. 
0-4 if predicted true; 9-10 if mispredicted. 


Latency is taken to be the difference in dispatching cycles from jmp1 to target operation, including the effect of an 
operation in the delay slot. Blocking cycles thus may include cycles due to restore in the delay slot. In a given pair x,y, 
x applies when predicted correctly and y when predicted incorrectly. Each x or y may be a range of values. 


0-4 if predicted true; 9 if mispredicted. 

This MEMBAR has NOP semantics, since the ordering specified is implicitly done by processor (memory model is TSO). 
All operations in flight complete as in MEMBAR #Sync. 

All operations in flight complete. 

Issue stalls a minimum of 7 cycles until all operations in flight are done (get to D-stage). 
Dispatching stalls until previous save in flight, if any, reaches D-stage. 

2 if predicted correctly, 9 otherwise. Similarly for blocking cycles. 

Dispatching stalls until previous restore in flight, if any, reaches D-stage. 

Dispatching stalls until previous restore in flight, if any, reaches D-stage. 

Same as MEMBAR #StoreStore, which is NOP. 

Dispatching stalls until all FP operations in flight are done. 

Wait for completion of all FP operations in flight. 


The Shutdown instruction is not implemented. The instruction is neutralized and appears as a NOP to software (no 
visible effects. 
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CHAPTER 5 





Data Formats 





The processor recognizes the following fundamental data types: 
Signed integer: 8, 16, 32, and 64 bits 
Unsigned integer: 8, 16, 32, and 64 bits 
VIS Instruction data formats: pixel (32 bits), fixed16 (64 bits), and fixed32 (64 bits) 
Floating-point: 32, 64, and 128 bits 


The widths of the data types are as follows: 
Byte: 8 bits 
Halfword: 16 bits 
Word: 32 bits 
Tagged word: 32 bits (30-bit value plus 2-bit tag; deprecated) 
Doubleword: 64 bits (deprecated in favor of Extended word) 
Extended word: 64 bits 
Quadword: 128 bits 
The signed integer values are stored as two's-complement numbers with a width 


commensurate with their range. In tagged words, the least significant two bits are treated as 
a tag; the remaining 30 bits are treated as a signed integer. 


Names are assigned to individual subwords of the multiword data formats as described in the 
following sections: 

Signed Integer Double 

Unsigned Integer Double 

Floating-Point, Double-Precision 


Floating-Point, Quad-Precision 
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Integer Data Formats 


The processor supports the following integer data formats: 


Integer Data Value Range 


Signed integer 
Unsigned integer 


Tagged integer word 


TABLE 5-1 describes the width and ranges of the signed, unsigned, and tagged integer data 












































formats. 
TABLE 5-1 Signed Integer, Unsigned Integer, and Tagged Integer Format Ranges 
Range 

Data Type Width (bits) Lower Upper 
Signed integer byte 8 =y 2^— 
Signed integer halfword 16 —2b5 25 =] 
Signed integer word 32 —23! 233—] 
Signed integer tagged word 32 —229 2/3 =] 
Signed integer double word 64 —28 29 —4 
Signed extended integer 64 —28 26 —] 
Unsigned integer byte 8 0 28 —1 
Unsigned integer halfword 16 0 216. —71 
Unsigned integer word 32 0 2924—3] 
Unsigned integer tagged word 32 0 239—541 
Unsigned integer double word 64 0 285 —4 
Unsigned extended integer 64 0 26 — | 
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241.2 


Integer Data Alignment 


TABLE 5-2 describes the memory and register alignment for integer data. 

























































































TABLE 5-2 Integer Data Alignment 
Memory 
Required Address | Register 
Subformat Address (Big- Number Register 
Type Width Subformat Field Alignment | endian) Alignment | Number 
SB signed_byte_integer<7:0> 
B (byte) None n Any r 
UB unsigned_byte_integer<7:0> 
SH signed_halfwd_integer<7:0> 
H (halfword) 0 mod 2 n Any r 
UH unsigned halfwd integer«7:0- 
SW signed word integer«7:07 
W (word) 0 mod 4 n Any r 
UW unsigned word integerc«7:0- 
SD-0 signed dbl integer«63:32- 
0 mod 8 n 0 mod 2 r 
UD-0 unsigned dbl integer«63:32-» 
D (double word) 
SD-1 signed_dbl_integer<31:0> 
4 mod 8 n+4 1 mod 2 rtl 
UD-1 unsigned dbl integer«3l:07 
SX signed_ext_integer<63:0> 
X (extended word) 0 mod 8 n — r 
UX unsigned_ext_integer<63:0> 
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5.1.3.1 


The data types are illustrated in the following subsections. 


Signed 


Figures in this section illustrate the following signed data types: 


Integer Data Types 


Signed integer byte 


Signed integer halfword 


Signed integer word 


Signed integer doubleword 


Signed extended integer 


Signed Integer Byte 


FIGURE 5-1 illustrates the signed integer byte data format. 
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SB C 


FIGURE 5-1 Signed Integer Byte Data Format 


5.1.3.2 Signed Integer Halfword 


FIGURE 5-2 illustrates the signed integer halfword data format. 


SH S 


FIGURE 5-2 Signed Integer Halfword Data Format 





5.1.3.3 Signed Integer Word 


FIGURE 5-3 illustrates the signed integer word data format. 


swe 
0 


31 30 


FIGURE 5-3 Signed Integer Word Data Format 


5.1.3.4 Signed Integer Double 


FIGURE 5-4 illustrates both components (SD-0 and SD-1) of the signed integer double data 
format. 


SD-0 |S signed dbl integer«62:32» 








31 30 0 
SD-1 signed dbl integer«31:0- 
31 0 


FIGURE 5-4 Signed Integer Double Data Format 
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5.1.3.5 Signed Extended Integer 


FIGURE 5-5 illustrates the signed extended integer (SX) data format. 





63 62 0 


FIGURE 5-5 Signed Extended Integer Data Format 


5.1.4 Unsigned Integer Data Types 


Figures in this section illustrate the following unsigned data types: 
Unsigned integer byte 
Unsigned integer halfword 
Unsigned integer word 
Unsigned integer doubleword 


Unsigned extended integer 


5.1.4.1 Unsigned Integer Byte 


FIGURE 5-6 illustrates the unsigned integer byte data format. 


UB p 





FIGURE 5-6 Unsigned Integer Byte Data Format 


5.1.4.2 Unsigned Integer Halfword 


FIGURE 5-7 illustrates the unsigned integer halfword data format. 


a 


FIGURE 5-7 Unsigned Integer Halfword Data Format 
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5.1.4.3 Unsigned Integer Word 
FIGURE 5-8 illustrates the unsigned integer word data format. 
UW 
31 0 
FIGURE 5-8 Unsigned Integer Word Data Format 
5.1.4.4 Unsigned Integer Double 
FIGURE 5-9 illustrates both components (UD-0 and UD-1) of the unsigned integer double data 
format. 
UD—0 unsigned_dbl_integer<63:32> | 
31 0 
UD-1 unsigned_dbl_integer<31:0> 
31 0 
FIGURE 5-9 Unsigned Integer Double Data Format 
5.1.4.5 Unsigned Extended Integer 
FIGURE 5-10 illustrates the unsigned extended integer (UX) data format. 
UX unsigned_ext_integer 
63 0 
FIGURE 5-10 Unsigned Extended Integer Data Format 
5.1.5 Tagged Word 
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The Tagged word data format is similar to the unsigned word format except for a 2-bit field 
in the two LSB positions. Bit 31 is the overflow bit. 


FIGURE 5-11 illustrates the tagged word data format. 
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TW of tag 


31 21 0 


FIGURE 5-11 Tagged Word Data Format 





32 Floating-Point Data Formats 


Single-precision, double-precision, and quad-precision floating-point data types are described 
below. 


Single-precision floating-point (32-bit) 
Double-precision floating-point (64-bit) 
Quad-precision floating-point (128-bit) 


5.2.1 Floating-Point Data Value Range 


The value range for each format is included with the format and description of each format. 


5.2.2 Floating-Point Data Alignment 


TABLE 5-3 describes the address and memory alignment for floating-point data. 


TABLE 5-3 Floating-Point Doubleword and Quadword Alignment 



































Required Memory Register 

Subformat Address Address Number Available 

Name Subformat Field Alignment (Big-endian)* Alignment Registers 

FS s:exp<7:0>: fraction<22:0> 0 mod 47 n Any fo, fL... f31 
FD-0 s:exp<l0:0>:fraction<51:32> 0 mod 4 Ï n 0 mod 2 fU, f2,... f62 
FD-1 fraction<3l:0> 0 mod 4 Î n*t4 1 mod 2 SL, f3,... f63 
FX-0 — 0 mod 4 Î n 0 mod 4 fU, f4,... f60 
FX-1 — 0 mod 47 n 0 mod 4 f2, f6.... f62 
FQ-0 s:exp<14:0>:fraction<111:96> 0 mod 4? n 0 mod 4 JO, f4,... f60 
FQ-1 fraction<95:64> 0 mod 4 Î n4 1 mod 4 fl. f,... fől 
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TABLE 5-3 Floating-Point Doubleword and Quadword Alignment (Continued) 



































Required Memory Register 
Subformat Address Address Number Available 
Name Subformat Field Alignment (Big-endian)* Alignment Registers 
FQ-2 fraction<63:32> 0 mod 4? n8 2 mod 4 fa, f6.... f62 
FQ-3 fraction<31:0> 0 mod 4? n+ 12 3 mod 4 
FX — 0 mod 4 Î n 0 mod 4 f8. F7... f63 
* The Memory Address in this table applies to big-endian memory accesses. Word and byte order are reversed when little-endian accesses are used. 





t Although a floating-point doubleword is required only to be word-aligned in memory, it is recommended that it be doubleword-aligned (that is, the 
address of its FD-0 word should be 0 mod 8 so that it can be accessed with doubleword loads/stores instead of multiple single word loads/stores). 


t Although a floating-point quadword is required only to be word-aligned in memory, it is recommended that it be quadword-aligned (that is, the 
address of its FQ-0 word should be 0 mod 16). 
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Floating-Point, Single-Precision 


FIGURE 5-12 illustrates the floating-point single-precision data format, and TABLE 5-4 


describes the formats. 





FS S| exp<7:0> fraction<22:0> 


31 30 


23 22 0 


FIGURE 5-12 Floating-Point Single-Precision Data Format 


TABLE 5-4 Floating-Point Single-Precision Format Definitions 





s = sign (1-bit) 

e = biased exponent (8 bits) 
f = fraction (23 bits) 

u = undefined 


Normalized value (0 < e < 255) 


(-1)8 x 2€ 177 x Lf 





Subnormal value (e = 0) 


(-D3 x 27126 x O.f 





Zero (e = 0) 


(C1 x 0 





Signalling NaN 


s =u; e = 255 (max); f = Ouu--uu 
(At least one bit of the fraction must be nonzero) 





Quiet NaN 


s =u; e = 255 (max); f = .luu--uu 





— C9 (negative infinity) 


s = 1; e = 255 (max); f = .000--00 








+ co (positive infinity) 


s = 0; e = 255 (max); f = .000--00 
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5.2.4 Floating-Point, Double-Precision 


FIGURE 5-13 illustrates both components (FD-0 and FD-1) of the floating-point 
double-precision data format when two 32-bit registers are used. FIGURE 5-14 illustrates a 
double-precision data format using one 64-bit register. 


TABLE 5-5 describes the data formats. 














FD-O |] ^ ex-100- fraction<51:32> | 
31 30 2019 0 
FD-1 fraction<31:0> 
31 0 


FIGURE 5-13 Floating-Point Double-Precision Double Word Data Format 











FX s exp<10:0> fraction<51:0> 
63 62 52 51 0 





FIGURE 5-14 Floating-Point Double-Precision Extended Word Data Format 


TABLE 5-5 Floating-Point Double-Precision Format Definition 























s = sign (1-bit) 

e = biased exponent (11 bits) 

f = fraction (52 bits) 

u = undefined 

Normalized value (0 < e < 2047) (71 x 2° 103 x Lif 

Subnormal value (e = 0) (-1)8 x 271022 x 0.f 

Zero (e = 0) (1) x0 

Signalling NaN s =u; e — 2047 (max); f = Ouu--uu 
(At least one bit of the fraction must be nonzero) 

Quiet NaN s =u; e = 2047 (max); f = .luu--uu 

— C9 (negative infinity) s = 1; e = 2047 (max); f = .000--00 

+ co (positive infinity) s = 0; e = 2047 (max); f = .000--00 











Chapter 5 Data Formats 67 


5.2.5 Floating-Point, Quad-Precision 


FIGURE 5-15 illustrates all four components (FQ-0 through FQ-3) of the floating-point 
quad-precision data format, and TABLE 5-6 describes the formats. 





Compatibility Note — Floating-point quad is not implemented in the processor. 
Quad-precision operations are emulated in the OS kernel. 

















FQ-0 S exp<14:0> fraction<111:96> 
FQ-1 fraction<95:64> 

81 0 
FQ-2 fraction<63:32> 

81 0 
FO-3 fraction<31:0> 

81 0 


FIGURE 5-15 Floating-Point Quad-Precision Data Format 


TABLE 5-6 Floating-Point Ouad-Precision Format Definitions 


s =sign (1-bit) 

e = biased exponent (15 bits) 
f = fraction (112 bits) 

u = undefined 





Normalized value (0 < e < 32767) 


(-1)$ x 2216383 x l.f 





Subnormal value (e = 0) 


(1) x 2716382 x Q.f 





Zero (e = 0) 


(15 x 0 





Signalling NaN 


s =u; e = 32767 (max); f = .Ouu--uu 
(At least one bit of the fraction must be nonzero.) 





Quiet NaN 


s =u; e = 32767 (max); f = .luu--uu 





— C9 (negative infinity) 


s = 1; e = 32767 (max); f = .000--00 








+ co (positive infinity) 





s = 0; e = 32767 (max); f = .000--00 
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5.3 


VIS Execution Unit Data Formats 


VIS instructions are optimized for short integer arithmetic, where the overhead of converting 
to and from floating point is significant. Data components can be 8 or 16 bits; intermediate 
results are 16 or 32 bits. 


There are two VIS data formats: 
Pixel Data 
Fixed-point Data 


Data Conversions 


Conversion from pixel data to fixed data occurs through pixel multiplications. Conversion 
from fixed data to pixel data is done with the pack instructions, which clip and truncate to an 
8-bit unsigned value. Conversion from 32-bit fixed to 16-bit fixed is also supported with the 
FPACKFIX instruction. 


Rounding 


Rounding can be performed by adding one to the round bit position. Complex calculations 
needing more dynamic range or precision should be performed using floating-point data. 


Range 


The range of values that each format supports is described below. 


Data Alignment 


The data in memory is expected to be aligned according to TABLE 5-7. If the address does not 
properly align, then an exception is generated and the load/store operation fails. 
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TABLE 5-7 Pixel, Fixed16, and Fixed32 Data Alignment 




















Memory 
VIS Data Required Address | Register 
Format Address (big- Number Register 
Type Width VIS Data Format Name Alignment | endian) Alignment | Number 
Pixel 8 Pixel Data Format 0 mod 4 n r r 
Fixedl6 Fixed16 Data Format 0 mod 8 n 0 mod 2 r 
Fixed32 Fixed32 Data Format 0 mod 8 n 0 mod 2 r 




















3.3.1 


Sone 
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Pixel Data Format 


The Fixed 8-bit data format consists of four unsigned 8-bit integers contained in a 32-bit 
word (see FIGURE 5-16). 


One common use is to represent intensity values for the color components of an image. For 
example, R, G, B, and & are used as color components and are positioned as shown: 


31 24 23 16 15 8 7 0 
FIGURE 5-16 Pixel Data Format with Band Sequential Ordering Shown 


The fixed 8-bit data format can represent two types of pixel data: 


Band interleaved images, with the various color components of a point in the image 
stored together 


Band sequential images, with all of the values for one color component stored together 


Fixed-Point Data Formats 


The fixed 16-bit data format consists of four 16-bit signed fixed-point values contained in a 
64-bit word. The fixed 32-bit format consists of two 32-bit signed fixed-point values 
contained in a 64-bit word. Fixed-point data values provide an intermediate format with 
enough precision and dynamic range for filtering and simple image computations on pixel 
values. 
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5.3.2.1 Fixed16 Data Format 


Fixed data values provide an intermediate format with enough precision and dynamic range 
for filtering and simple image computations on pixel values. 


Perform rounding by adding one to the round bit position. Perform complex calculations 
needing more dynamic range or precision by means of floating-point data. 


The fixed 16-bit data format consists of four 16-bit, signed, fixed-point values contained in a 
64-bit word. FIGURE 5-17 illustrates the Fixed16 VIS data format. 


integer | fraction integer | fraction integer | fraction integer | fraction 
63 48 47 32 31 16 15 0 





FIGURE 5-17 Fixedl6 VIS Data Format 


5.3.2.2 Fixed32 Data Format 


The fixed 32-bit format consists of two 32-bit, signed, fixed-point values contained in a 
64-bit word. FIGURE 5-18 illustrates the Fixed32 VIS data format. 





integer fraction integer fraction 
63 32 31 0 


FIGURE 5-18 Fixed32 VIS Data Format 
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CHAPTER Ô 





Registers 





6.1 


The topics covered in this chapter are discussed in the following sections: 


9999 


Section 6.1, “Introduction 


9999 


Section 6.2, “Integer Unit General-Purpose r Registers 


9999 


Section 6.3, “Register Window Management 


9999 


Section 6.4, “Floating-Point General-Purpose Registers 


9999 


Section 6.5, “Control and Status Register Summary 


9999 


Section 6.6, “State Registers 


Section 6.7, “Ancillary State Registers: ASRs 16-25”” 


9999 


Section 6.8, “Privileged Registers 


9999 


Section 6.9, “Special Access Register 


Section 6.10, “ASI Mapped Registers" 


Introduction 


The processor consists of many types of registers that serve various purposes and accessed in 
many different ways. 


There are separate working registers for the integer and floating-point units (FPUs). Both of 
the these register sets have been expanded over the evolution of the SPARC processor. The 
integer unit registers are shadowed using windowing and selection methods. The registers in 
the floating-point register set (also used for VIS and block load store instructions) are 
combined in specific ways to support data sizes up to 128 bits. All integer registers and the 
upper floating-point registers are 64 bits wide. 
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6.1.1 


The processor also has a vast array of control, status, state, and diagnostic registers that are 
used to setup, control, and operate the processor. The two main operating modes of the 
processor, privileged and non-privileged mode, have a profound effect on which of the 
control and status registers are available to the software. 


The majority of the control and status registers are 64 bits wide and are accessed using the 
privileged register access instructions, state register access instructions, and load/store with 
ASI access instructions. For convenience, some registers in this chapter are illustrated as 
fewer than 64 bits wide. Any bits not shown are reserved for future extensions to the 
architecture. Such reserved bits are read as zeroes and when written by software, should be 
written with the values of those bits previously read from that register or with zeroes. 


Integer Unit Working Registers (includes r and global) 
Floating-point Unit Working Registers 
Privileged Registers 
State and Ancillary State Registers (includes ASRs) 
Floating-point Status Register (FSR) 
ASI Mapped Registers (CSRs) 
Some of the figures and tables in this chapter are reproduced from The SPARC Architecture 


Manual-Version 9 and other sources. Many diagrams and tables appear here for the first 
time. 


Document Notes 


Contents of this chapter apply to non-privileged mode unless stated otherwise. 





6.2 
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Integer Unit General-Purpose r Registers 


An UltraSPARC Ili processor contains 160 general-purpose 64-bit r registers. They are 
windowed into 32 registers addressable by Integer Unit Instructions. 


The r registers are partitioned into eight addressable global registers and 24 addressable 
windowed registers. There are four global register sets: normal, MMU, Interrupt, and 
Alternate. The windowed registers point to eight working register sets that are windowed into 
r[8] to r[31], as one full register set (eight locals and eight ins) and a half register set 
(eight outs) belonging to the next higher state. 


In summary, the r registers consist of eight in registers, eight Local registers, eight out 
registers, and the selected eight global registers. 
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The current window pointer (CWP) register selects the in/local/out windowed registers. 





SAVE and RESTORE 














The PSTATE.AG, PSTATI 
Processor exceptions modify the PSTATI 








PSTATE and CWP registers are accessible using privileged instructions. 


At any moment, general-purpose registers appear in non-privileged mode as shown in 


TABLE 6-1. 





E.IG, and PSTATI 


instructions modify the CWP register. 








TABLE 6-1 Integer Unit General-Purpose Registers 


E .MG fields select the global register set. 
E register fields to select the global register set. 
















































































Windowed r Register 

Register Name |Address Source 

in[7] r[31 Current Register Set 

in[6] r[30 Current Register Set 

in[5] r[29 Current Register Set 

in[4] r[28 Current Register Set 

in[3] r[27 Current Register Set 

in[2] r[26 Current Register Set 

in[1] r[25 Current Register Set 

in[0] r[24 Current Register Set 

local[7 r[23 Current Register Set 

local[6 r[22 Current Register Set 

local[5 r[21 Current Register Set 

local[4 r[20 Current Register Set 

local[3 r[19 Current Register Set 

local|2 r[18 Current Register Set 

local[1 r[17 Current Register Set 

local[0 r[16 Current Register Set 

out|7] r[15 Next higher level Register Set (see footnote 1) 
out[6] r[14] Next higher level Register Set 
out[5] r[13] Next higher level Register Set 
out[4] r[12] Next higher level Register Set 
out[3] r[11] Next higher level Register Set 
out[2] r[10] Next higher level Register Set 
out[1] r[9 Next higher level Register Set 
out[0] r[8 Next higher level Register Set 
global[7 r[7 Global[ 7 

global[6 r[6 Global[ 6 

global[5 r[5 Global[ 5 

global[4 r[4 Global[ 4 

global[3 r[ 3| Global[ 3 

global[2 r[2 Global[ 2] 

global[1 r[1 Global[ 1 

global[0 r[0 Global[ 0] (value(r[ 0]) always 0) 





1. The CALL instruction writes its own address into the r[15] register (our[7]). 
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6.2.1 


6.2.1.1 


6.2.1.2 


6.2.2 
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Windowed (in/local/out) r Registers 


At any time, an integer unit instruction can access a 24-register window into the register sets. 
A register window comprises of the eight in and eight local registers (a complete register 
set) together with the eight in registers (upper half of the next higher register set). 


Predefined r Register Usages 


Two of the r registers have a specific usage: 
The value of r[0] is always zero; writes to it have no program-visible effect. 


The CALL instruction writes its own address into register r[15] (out register 7). 


128-bit Operand Considerations 


LDD, LDDA, STD, and STDA instructions access 128-bit data associated with adjacent 

r registers and require even-odd register alignment. An attempt to execute a LDD, LDDA, 
STD, or STDA instruction that refers to a misaligned (odd) destination register number causes 
an illegal instruction trap. 


Global r Register Sets 


Registers r [0] —-r [7] refer to a set of eight global registers (g0—97). At any time, one of 
four sets of eight global register sets is selected and can be accessed as the current global 
register set. The currently enabled set of global registers is selected by the Alternate Global 
(AG), Interrupt Global (IG), and MMU Global (MG) fields in the PSTATE register. See 
Section 6.8.3 “Processor State (PSTATE) Privileged Register 6” on page 6-107 for a 
description of the AG, IG, and MG fields. 





Global register zero (g0) always reads as zero; writes to it have no program-visible effect. 


FIGURE 6-1 illustrates the current IU registers. 
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FIGURE 6-1 Three Overlapping Windows and the Eight Global Registers 








Compatibility Note — Since the PSTATE register is writable only by privileged software, 
existing non-privileged SPARC-V8 software operates correctly on a processor if Supervisor 





Software ensures that User Software sees a consistent set of global registers. 
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6.2.2.1 


In summary, the processor has eight windows or register sets (NWINDOWS = 8). The total 
number of r registers in the processor is 160: eight normal global registers, eight alternate 
global registers, eight interrupt global registers, eight MMU global registers, plus the number 
of register sets (eight) times 16 registers/set. 


Overlapping Windows 


Each window shares its ins with one adjacent window and its outs with another. The outs of 
the CWP — 1 (modulo NWINDOWS) window are addressable as the ins of the current window, 
and the outs in the current window are the ins of the CWP + 1 (modulo NWINDOWS) window. 
The locals are unique to each window. 


An outs register with address o, where 8 < o < 15, refers to exactly the same register as 

(o + 16) does after the CWP is incremented by one (modulo NWINDOWS). Likewise, an in 
register with address i, where 24 < i < 31, refers to exactly the same register as address 

(i — 16) does after the CWP is decremented by one (modulo NWINDOWS). See FIGURE 6-1 and 
FIGURE 6-2 for additional information. 


Since CWP arithmetic is performed modulo NWINDOWS, the highest-numbered implemented 
window (window 7) overlaps with window 0. The outs of window NWINDOWS — 1 are the ins 
of window 0. Implemented windows are numbered contiguously from 0 through 
NWINDOWS — 1. 





6.3 
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Register Window Management 


The current window in the windowed portion of r registers is given by the CWP register. The 
CWP is decremented by the RESTORE instruction and incremented by the SAVE instruction. 
Window overflow is detected by the CANSAVE register, and window underflow is detected by 
the CANRESTORE register, both of which are controlled by privileged software. A window 
overflow (underflow) condition causes a window spill (fill) trap. 
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Programming Note — Because the windows overlap, the number of windows available to 
software is one less than the number of implemented windows; that is, 7 (NWINDOWS — 1). 





CWP - 0 ee ee 
(Current Window Pointer) 
CANSAVE = 4 






CANRESTORE = 1 


/ 
/ 
/ 
OTHERWIN = 1 


CANSAVE 





RESTORE 





+ CANRESTORE + OTHERWIN = NWINDOWS - 2 














The current window (window 0) and the overlap window (window 
5) account for the two windows in the right side of the equation. 
The “overlap window” is the window that must remain unused 
because its ins and outs overlap two other valid windows. 





NWINDOWS = 8, CWP = 0, CANSAVE = 4, OTHERWIN = 1, and 
CANRESTORE = 1. If the procedure using window w0 executes a 


R 














ESTOR 





E, then window w7 becomes the current window. If the 








procedure using window w0 executes a SAVE, then window wl 
becomes the current window. 


FIGURE 6-2 Windowed r Registers for NWINDOWS = 8 
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6.3.1 


6.3.2 


6.3.3 


CALL and JMPL Instructions 





Programming Note — Since the procedure call instructions (CALL and JMPL) do not 
change the CWP, a procedure can be called without changing the window. 





Circular Windowing 


Programming Note — When the register file is full, the outs of the newest window are the 
ins of the oldest window, which still contains valid data. 





Clean Window with RESTORE and SAVE Instructions 





Programming Note — The local and out registers of a register window are guaranteed 
to contain either zeroes or an old value that belongs to the current context upon reentering 
the window through a SAVE instruction. If a program executes a R 


SAVI 


trap 








may have occurred between the RESTORE and the SAVE. 








ESTOR 





E followed by a 








E, then the resulting window's /ocals and outs may not be valid after the SAVE, since a 
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Floating-Point General-Purpose Registers 


The floating-point register file contains addressable registers for the following: 


Floating-point Instructions 


VIS instructions 


B 





Rh 


lock load and store instructions 


SR load and store instructions 


The registers have various widths and assigned addresses as follows: 


32 32-bit (single-precision) floating-point registers, £[0], £[1], ... £[31] 
32 64-bit (double-precision) floating-point registers, £ [0], £[2], ... £[62] 
16 128-bit (quad-precision) floating-point registers, £[0], £[4], ... £[60] 
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The floating-point registers are arranged so that some of them overlap, that is, are aliased. 


The layout and numbering of the floating-point registers is shown in TABLE 6-2, TABLE 6-3, 


and TABLE 6-4. Unlike the windowed r registers, all of the floating-point registers are 
accessible at any time. The floating-point registers can be read and written by FPop 
(FPop1/FPop2 format) instructions, load/store single/double/quad floating-point 


instructions, and block load and block store instructions. 
































TABLE 6-2 32-bit Floating-Point Registers with Aliasing 
and Field From Register and Field From Register 
BT [310 
f30 | <31:0> £30<31:0> f14 | <31:0> £14<31:0> 
f29 | <31:0> £29<31:0> f13 | <31:0> £13<31:0> 
f28 | <31:0> £28<31:0> f12 | <31:0> £12<31:0> 
7 | 310» 
f26 | <31:0> £26<31:0> 31:0 £10<31:0> 
f25 | <31:0> £25<31:0> £9<31:0> 
f24 | <31:0> £24<31:0> £8<31:0> 
f23 | <31:0> 
f22 | <31:0> £22<31:0> f 6<31:0> 
f21 | <31:0> £21<31:0> £5<31:0> 
f20 | <31:0> £20<31:0> £4<31:0> 
fl9 | <31:0> 
f18 | <31:0> £18<31:0> £2<31:0> 
f17 | <31:0> £17<31:0> £1<31:0> 
fl6 | <31:0> £16<31:0> £0<31:0> 
TABLE 6-3 64-bit Floating-Point Registers with Aliasing 


Operand Register 
and Field 





From Register 
f 62<63:0> 








Operand Register 
and Field 






From Register 
£30«31:0»: 


£31<31:0> 





f 60<63:0> 


£28<31:0>: 


£29<31:0> 





£58<63:0> 
£56<63:0> 
£54<63:0> 


£22<31:0> 


£26<31:0>: 
£24<31:0>: 
:f23<31:0> 


£27<31:0> 
£25<31:0> 





£52<63:0> 


£20«31:0»: 


£21<31:0> 





£50<63:0> 
£48<63:0> 
£4 6<63:0> 


£18<31:0>: 
£16<31:0>: 
£14<31:0>: 


£19<31:0> 
£17<31:0> 
£15<31:0> 





£44<63:0> 


£12<31:0>: 


£13<31:0> 





£42<63:0> 


£38<63:0> 


£10<31:0>: 





£11<31:0> 


£8<31:0>:f 9<31:0> 
£6<31:0>:f7<31:0> 





£36<63:0> 


£4<31:0>:£5<31:0> 
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£34<63:0> 
£32<63:0> 
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£2<31:0>:£3<31:0> 
£0<31:0>:£1<31:0> 
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TABLE 6-4  128-bit Floating-Point Registers with Aliasing 






































| Operand Register] ‘4 
and Field From Register 

f60 | <127:0> £60<63:0>:£62<63:0> 
£56 | <127:0> £5 6<63:0>:£58<63:0> 
£52 | <127:0> £52<63:0>:£54<63:0> 
£48 | <127:0> £48<63:0>:£50<63:0> 
f44 | <127:0> £44<63:0>:£4 6<63:0> 
f40 | <127:0> £40<63:0>:£42<63:0> 
£36 | <127:0> £36<63:0>:£38<63:0> 
£32 | <127:0> £32<63:0>:£34<63:0> 
£28 | <127:0> £28<31:0>:f29<31:0>:f30<31:0>:f31<31:0> 
f24 | <127:0> £24<31:0>:f25<31:0>:f26<31:0>:f27<31:0> 
f20 | <127:0> £20<31:0>:f21<31:0>:f22<31:0>:f23<31:0> 
f16 | <127:0> £16<31:0>:f17<31:0>:f18<31:0>:f19<31:0> 
f12 | <127:0> £12<31:0>:f13<31:0>:f14<31:0>:f15<31:0> 
f8 |<127:0> £8<31:0>:£9<31:0>:£10<31:0>:£11<31:0> 
f4 | <127:0> £4<31:0>:£5<31:0>:£6<31:0>:£7<31:0> 
fO | <127:0> £0<31:0>:£1<31:0>:£2<31:0>:£3<31:0> 





Floating-Point Register Number Encoding 


The floating-point register number encoding in the instruction field depends on the width of 
register being addressed. The encoding for the 5-bit instruction field (labeled b<4>—b<0>, 
where b<4> is the most significant bit of the register number), is given in TABLE 6-5. 


TABLE 6-5 Floating-Point Register Number Encoding 





Register Operand 
Type 


6-bit Register Number, fn 


Encoding in a 5-bit Register Field in 


an Instruction, rd/rs 





32-bit (single) 
64-bit (double) 





0 b<4>| b<3> 
b<4>| b<3> 


b<2>| b<l>| b<0> 


b<2>| b<l> 


0 


b<4>| b<3>| b<2> 
b<4>| b<3>| b<2> 


b<l>| b<0> 
b<l>| b<5> 








128-bit (guad) 


b<4>| b<3> 








0 





b<4>| b<3>| b<2> 





b<5> 


Compatibility Note — In SPARC-VS, bit 0 of 64- and 128-bit register numbers encoded in 
instruction fields was required to be zero. Therefore, all SPARC-V8 floating-point instructions 
can run unchanged on an UltraSPARC Illi processor, using the encoding in TABLE 6-5. 
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6.4.2 


Double and Quad Floating-Point Operands 


A 32-bit £ register can hold one single-precision operand; a 64-bit (double-precision) 
operand requires an aligned pair of £ registers, and a 128-bit (quad-precision) operand 
requires an aligned quadruple of f registers. At a given time, the floating-point registers can 
hold a maximum of 32 single-precision, 16 double-precision, or 8 quad-precision values in 
the lower half of the floating-point register file, plus an additional 16 double-precision or 

8 quad-precision values in the upper half, or mixtures of the three sizes. 


See FIGURE 6-3, TABLE 6-2, TABLE 6-3, and TABLE 6-4 for illustrative formats. 





Programming Note — Data to be loaded into a floating-point double or quad register that 
is not doubleword aligned in memory must be loaded into the lower 16 double registers 

(8 quad registers) by means of single-precision LDF instructions. If desired, the data can then 
be copied into the upper 16 double registers (8 quad registers). 


An attempt to execute an instruction that refers to a misaligned floating-point register 
operand (that is, a quad-precision operand in a register whose 6-bit register number is not 
0 mod 4) shall cause a fp. exception other trap, with FSR. ftt = 6 (invalid fp register). 


Given the encoding in TABLE 6-5, it is impossible to specify a double-precision register with 
a misaligned register number. 





Note — The processor does not implement quad-precision operations in hardware. All 
floating-point quad (including load and store) operations trap to the OS kernel and are 
emulated. Since the processor does not implement quad floating-point arithmetic operations 
in hardware, the fp. exception other trap with FSR. ftt = 6 (invalid fp register) does not 
Occur in processors. 








6.5 


Control and Status Register Summary 


This section presents a summary of control and status registers. 
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Integer Unit 




















































































































General-Purpose 
NWINDOWS r Registers 
Register Sets i 
r > Ins — r[31:24] > Normal 
63 63 
> Locals «t r[23:16] > MMU 
L 63 = 63 
r gE—— Outs -<q r[15:8] > Interrupt 
























































63 0 - ee 63 

GR r7:0] sie > Alternate 
L J Register Window | 1[7:0] Selected by 

| > PSTATE.AG, IG, MG 
> RESTORE 








63 
















































































































































































63 
- j WORD (32): f0, f1,... £31 
: : : DOUBLEWORD(32): f0, f2,... f62 
63 0 J Floating-point Unit QUADWORD (16): f0, f4,... f60 
> General-Purpose 
Registers Le] mo |] 
L 63 63 0 63 0 
r Floating-point Numbers oe $8 E O DOUBLEWD 
cs J VIS Data Numbers 154 L__f52 |. Example 
Block Copy Function 63 Ge. 88 9 {58 
FSR Register Access f50 f48 Example 
63 0 | 63 0 {48 
63 0 144 
S Note: There are no odd 63 146 0 & 9 
r numbered registers above f31. L— 12 |T E0 |  QUADWD 
63 WORDS cannot be loaded 63 coe & 0 Example 
SS into £32 through (62. [ 88 . ] 136 140 
635. 0 0868. >> @ 
£34 £32 
L 63 63 0 63 0 
SAVE 
[ i f3[ 1] eol |. aL] es 
63 0 31 0 31 0 31 0 3i Ü 
t$7[ Ŵ] fae Lt Ld mm 
31 0 31 0 31 0 31 0 QUADWD 
55 ?t3[ | eL 1] f1L[ | feo[. —]  Exampe 
L 31 0 31 0 31 0 31 0 f20 
" no[ 1] ne[ AG] ne 
31 0 31 0 31 0 31 0 DOUBLEWD 
83 ns C f4L ]| naL —] me Example 
31 0 31 0 31 0 31 0 f12 
"iL | ofl) fo[ ^] foe Example 
31 0 31 0 31 0 31 0 f06 
63 0 f 
7 Wr [e iis D J Ma E J 104 31 0 WORD 
o3 L—4]4 0 Me [l 0 Ml tJ 0 fo0 ES] ERR 
63 0 








Circulates FIGURE 6-3 Integer Unit r Registers and Floating-Point Unit Working Registers 
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6.5.1 State and Ancillary State Register Summary 


State yP 
Registers | ccr 
ASI 
TICK Non-Privileged Read OK, if TICK.NPT = 0 
PC 





RD 
WR 


(to/from 


IU Working FPRS 
Registers) PCR 








PIC Non-Privileged Read OK, if PCR.PRIV = 0 
DCR 
GSR 
Set Softlnt 
Clr_Softlnt 0 ASRs 
Softnt 
TICK_CMP 
STICK Non-Privileged Read OK, if STICK.NPT = 0 


STICK_CMP 














FIGURE 6-4 State and Ancillary State Registers 


TABLE 6-6 State and Ancillary State Registers 


State Register 
Number Access 
(base 10 used) | Restriction Abbreviation Description Reference Section | Notes 


0 None RW yD Register 32-bit Multiply/Divide 
(deprecated) 


BUDD. o — 
RW CCR 











2 None Condition Code 
None Address Space Identifier Section 6.6.3 
TICK register for Processor Section 6.7.4 1 
4 Depends Timer, also accessible as a 
privileged register 
None Program Counter Section 6.6.5 








6 None Floating-Point Registers State 
ASR 7-15 —— Reserved for future use, do not 
reference by software. 
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TABLE 6-6 State and Ancillary State Registers (Continued) 












State Register 
Number Access 


(base 10 used) | Restriction Abbreviation Description Reference Section | Notes 
ASR 16 









Privileged PCR Performance Instrumentation Chapter 11 2 








“Performance 3 
ASR 17 Depends Instrumentation” 
ASR 18 Privileged Dispatch Control Register Section 6.7.1 
ASR 19 None Graphics (VIS) Status Register | Section 6.7.2 









Privileged SET_SOFTINT | Software Interrupts 
ASR 21 Privileged CLR_SOFTINT 
ASR 22 Privileged SOFTINT_REG 


Privileged TICK_CMP Processor and System Timer 
Registers 


Section 6.7.3 























Section 6.7.4 






Depends 





ASR 25 Privileged STICK_CMP 





Reserved for future use, do not 
reference by software. 









ASR 26 - 31 Reserved 




















1. Writes are always privileged; reads are privileged if TICK .NPT = 1. Otherwise, reads are non-privileged. 
2. If PCR.NC =0, access is always privileged. If PCR.NC #0 and PCR. PRIV = 0, access is non-privileged; otherwise, access is privileged. 
3. All accesses are privileged if PCR. PRIV = 1; otherwise, all accesses are non-privileged. 


4. Writes are always privileged; reads are privileged if STICK . NPT = 1. Otherwise, reads are non-privileged. 
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6.5.2 Privileged Register Summary 


rs/rd 
R/W TL-4 


Index Cn 


Es TC EU RW Trap States iw ——s 
egisters | ape m] mw LL 7 11-2 x e 
63 TNPC pz] 


TSTATE RW ra 





















































TL=1 [Tc mmm 
TNPC P E1 


TL=0 [mc pm E 
TNPC 
TSTATE 























RDPR 
WRPR RW 


(to/from TEK 3 Du 
IU Working TBA RW 


Registers) PSTATE RW 
TL RW 

PIL RW 

CWP RW 

CANSAVE RW 

CANRESTORE RW 

CLEANWIN RW 


OTHERWIN 
WSTATE 









































OO «o ON 2 a0 BF 2o m 


ASI PSTATE CWP 


E. AS CES (A 
o mo = 


FA 
H 
H 
H 
H 
H 
H 
H 
H 


Reserved 
63 


0 
VER pne 
63 


FIGURE 6-5 Privileged Registers 
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Register Number 


Privileged 


(base 10 used) 


TABLE 6-7 Privileged Registers 


Access 


Restriction | R/W | Abbreviation 


Description 


Reference 
Section 


Notes 
























































88 











0 GO TPC Trap stage program counter 
1 RW | TNPC Trap state next program counter 
E w E - dis Section 6.8.1 
2 Privileged TSTATE Trap state register 
3 Privileged TT Trap type register 
an ICK Processor TICK timer register, also | Section 6.7.4 
4 Privileged ; : 
accessible as a state register 
5 Privileged TBA Trap base address register Section 6.8.2 
6 Privileged PSTATE Processor state register Section 6.8.3 
7 Privileged TL Trap level register Section 6.8.4 
8 Privileged | RW | PIL Processor Interrupt Level register Section 6.8.5 
9 Privileged CWP Current window pointer 
10 Privileged CANSAVE Savable register sets 
11 Privileged | RW | CANRESTORE _ | Restorable register sets Section 6.8.6 
12 Privileged | RW | CLEANWIN Clean register sets 
13 Privileged | RW OTHERWIN Other register sets susceptible to 
spill/fill 
14 Privileged | RW WSTATE Window state register for traps due | Section 6.8.7 
to spills and fills 
15 - 30 Privileged Reserved 
31 Privileged | R | VER Processor version register Section 6.8.8 
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6.5.3 ASI and Specially Accessed Register Summary 


tatus Registers 
(ASI mapped) 


DeUcR 

VAWatchpont D~] 
63 

PA Watchpoit D 
63 


Special Access Registers 


FSR [a STFSR, STXFSR 


LDFSR, LDXFSR 





FIGURE 6-6 ASI and Specially Accessed Registers 


TABLE 6-8 ASI and Specially Accessed Registers 














Reference 
Type Abbreviation Description Section 
ASI DCUCR Data Cache Unit Control Section 6.10.1 
Register 
ASI 5815 | PA WATCHPOINT Watchpoint for physical 
addresses 
Section 6.10.2 
VA WATCHPOINT Watchpoint for virtual 
addresses 
LD/ST Load/Store FSR Access the Floating-point 
floating- Status Register 
point 
Opcode 
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6.6 


6.6.1 


State Registers 


The state registers provide control and status to the Integer Execution Unit. 


The type and accessibility of the registers (privileged vs. non-privileged mode) are 
summarized in FIGURE 6-4. 


The SPARC-V9 architecture provides for up to 31 state registers, 24 of which are classified 
as ancillary state registers (ASRs), numbered from 7 through 31. The eight State Registers, 
0 through 7, are defined by SPARC-V9. 


32-bit Multiply/Divide (YP) State Register 0 


The Y register is deprecated; it is provided only for compatibility with previous versions of 
the architecture. It should not be used in new SPARC-V9 software. It is recommended that 
all instructions that reference the Y register (that is, SMULP, SMULcc?, UMULP, UMULccP, 

MULScc?, sp1vP, sp1vccP, UDIVP, UDIVccP, RDYP, and WRY?) be avoided. 


The low-order 32 bits of the Y register, illustrated in FIGURE 6-7, contain the more significant 
word of the 64-bit product of an integer multiplication, as a result of either a 32-bit integer 
multiply (SMULP, SMULcc?, UMULP, UMULcc?) instruction or an integer multiply step 
(MULScc) instruction. The Y register also holds the more significant word of the 64-bit 
dividend for a 32-bit integer divide (SDIVP, SDIVec?, UDIV?, UDIVcc?) instruction. 


63 


6.6.2 


90 


32 31 0 


FIGURE 6-7 Y Register 


Although Y is a 64-bit register, its high-order 32 bits are reserved and always read as zero. 
The Y register is read and written with the RDYP and WRYP instructions, respectively. 


Integer Unit Condition Codes State Register 2 (CCR) 


The Condition Codes Register (CCR), shown in FIGURE 6-8, holds the integer condition 
codes. 


The CCR is accessible using Read and Write State Register instructions (RDCCR and WRCCR) 
in non-privileged or privileged mode. 
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6.6.2.1 


CCR XCC icc 


FIGURE 6-8 Condition Codes Register 


CCR Condition Code Fields (xcc and icc) 


All instructions that set integer condition codes set both the xcc and icc fields. The xcc 
condition codes indicate the result of an operation when viewed as a 64-bit operation. The 
icc condition codes indicate the result of an operation when viewed as a 32-bit operation. 
For example, if an operation results in the 64-bit value 0000 0000 FFFF FFFF e, the 32-bit 
result is negative (icc.N is set to one) but the 64-bit result is nonnegative (xcc.N is set to 
Zero). 


Each of the 4-bit condition code fields is composed of four 1-bit subfields, as shown in 
FIGURE 6-9. 


6 5 


XCC: 7 4 64-bit Interpretation 
icc: 3 2 1 0 32-bit Interpretation 


FIGURE 6-9 Integer Condition Codes (CCR. icc and CCR_xcc) 


The n bits indicate whether the two's-complement ALU result was negative for the last 
instruction that modified the integer condition codes; 1 — negative, 0 — nonnegative. 


The z bits indicate whether the ALU result was zero for the last instruction that modified the 
integer condition codes; 1 — zero, 0 — nonzero. 


The v bits signify whether the ALU result was within the range of (was representable in) 
64-bit (xcc) or 32-bit (icc) two's-complement notation for the last instruction that modified 
the integer condition codes; 1 — overflow, 0 — no overflow. 


The c bits indicate whether a two's complement carry (or borrow) occurred during the last 
instruction that modified the integer condition codes. Carry is set on addition if there is a 
carry out of bit 63 (xcc) or bit 31 (icc). Carry is set on subtraction if there is a borrow into 
bit 63 (xcc) or bit 31 (icc); 1 = carry, 0 = no carry. 
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6.6.3 
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Condition Codes 


These bits are modified by the arithmetic and logical instructions, the names of which end 
with the letters “cc” (for example, ANDcc) and by the WRCCR instruction. They can be 
modified by a DONE or RETRY instruction, which replaces these bits with the CCR field of 
the TSTATE register. The BPcc and Tcc instructions may cause a transfer of control based 
on the values of these bits. The MOVcc instruction can conditionally move the contents of an 
integer register based on the state of these bits. The FMOVcc instruction can conditionally 
move the contents of a floating-point register according to the state of these bits. 

















CCR extended integer. cond. codes (xcc) 


Bits 7 through 4 are the IU condition codes, which indicate the results of an integer 
operation, with both of the operands and the result considered to be 64 bits wide. 


CCR integer. cond. codes (icc) 


Bits 3 through 0 are the IU condition codes, which indicate the results of an integer 
operation, with both of the operands and the result considered to be 32 bits wide. In addition 
to the BPcc and Tcc instructions, the Bicc instruction may also cause a transfer of control 
based on the values of these bits. 


Address Space Identifier (ASI) Register ASR 3 


The ASI Register, shown in FIGURE 6-10, specifies the ASI to be used for load and store 
alternate instructions that use the “rs1 + simm13" addressing form. 


Non-privileged (user-mode) software may write any value into the ASI register; however, 
values with bit 7 = 0 select restricted ASIs. When a non-privileged instruction makes an 
access that uses an ASI with bit 7 = 0, a privileged action exception is generated. 


7 0 


FIGURE 6-10 Address Space Identifier Register 
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6.6.4 


6.6.5 


6.6.6 


TICK Register (TICK) ASR4 


See Section 6.7.4 “Timer State Registers: ASRs 4, 23, 24, 25” on page 6-101 for more 
details. 


Program Counters State Register 5 


The program counter (PC) contains the address of the instruction currently being executed. 
The next program counter (nPC) holds the address of the next instruction to be executed if a 
trap does not occur. The low-order two bits of PC and nPC always contain zero. 


For a delayed control transfer, the instruction that immediately follows the transfer 
instruction is known as the delay instruction. This delay instruction is executed (unless the 
control transfer instruction annuls it) before control is transferred to the target. During 
execution of the delay instruction, the nPC points to the target of the control transfer 
instruction, and the PC points to the delay instruction. See Chapter 7 “Instruction Types” for 
more details. 


The PC is used implicitly as a destination register by CALL, Bicc, BPcc, BPr, FBfcc, 
FBPfcc, JMPL, and RETURN instructions. It can be read directly by a RDPC instruction. 





Floating-Point Registers State (FPRS) Register 6 


The Floating-Point Registers State (FPRS) Register, shown in FIGURE 6-11, holds control 
information for the floating-point register file. Mode and status information about the 
Floating-point Unit is presented in Section 6.9.1 “Floating-Point Status Register (FSR)" on 
page 6-117. 


This register is readable and writable using the read and write state register instructions 
RDFPRS and WRFPRS when the processor is in non-privileged or privileged mode. 


2 1 0 


FIGURE 6-11 Floating-Point Registers State Register 
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6.6.6.1 


6.6.6.2 


6.6.6.3 


FPRS_enable_fp (FEF) 








Bit 2, FEF, determines whether the FPU is enabled. If this bit is set but the PSTATE. PEF bit 
is not set, then executing a floating-point instruction causes a fp_disabled trap; that is, both 
FPRS.FEF and PSTATE.PEF must be set to enable floating-point operations. If it is 
disabled, executing a floating-point instruction causes a fp_disabled trap. 





























FPRS_dirty_upper (DU) 


Bit 1 is the “dirty” bit for the upper half of the floating-point registers; that is, £32— £62. It 
is set whenever any of the upper floating-point registers is modified. The processor may set 
the bit whenever a floating-point instruction is issued, even though that instruction never 
completes and no output register is modified. The dirty bit may be set by instructions that the 
processor executes but does not complete due to wrong branch prediction. The DU bit is 
cleared only by software. 


FPRS_dirty_lower (DL) 


Bit 0 is the “dirty” bit for the lower 32 floating-point registers; that is, £0—£31. It is set 
whenever any of the lower floating-point registers is modified. The processor may set the bit 
whenever a floating-point instruction is issued, even though that instruction never completes 
and no output register is modified. The DL bit is cleared only by software. 





6.7 
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Ancillary State Registers: ASRs 16-25 


The SPARC-V9 architecture provides for optional ancillary state registers (ASRs) in addition 
to the six state registers defined for all SPARC-V9 processors and already described. 


An ASR is read and written with the RDASR and WRASR instructions, respectively. Access to 
a particular ASR may be privileged or non-privileged. A RDASR or WRASR instruction is 
privileged if the accessed register is privileged. 


All the state and ancillary state registers are summarized in TABLE 6-6. Some of the registers 
descriptions are presented below. 


Note — PCR (ASR 16) and PIC (ASR 17) are discussed in detail in Chapter 11 
“Performance Instrumentation.” 
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6.7.1 


Dispatch Control Register (DCR) ASR 18 


The DCR provides control over the dispatch unit and branch prediction logic. The DCR also 
provides factory test equipment with access to internal logic states using the OBSDATA bus 
interface. 


The DCR is a read/write register. Unused bits are read as zero and should be written only 
with zero or values previously read from them. The DCR is a privileged register; attempted 
access by non-privileged (user) code causes a privileged_opcode trap. POR value is 
XXXX.XX0X>. 


The DCR is illustrated in FIGURE 6-12 and described in TABLE 6-9. 


mE DPE | OBS | BPE| RPE | si | IPE |IFPOE| MS 
63 2 Tr. 6 5 4 3 2 1 o 





FIGURE 6-12 Dispatch Control Register (ASR 0x12) 


TABLE 6-9 DCR Bit Description 





Branch and Return Control 


Field 


Type Description 
Reserved 


Data Cache Parity Error Enable. If cleared, no parity checking at the Data 
Cache SRAM arrays (Data, Physical Tag, and Snoop Tag arrays) will be 
done. It also implies no dcache_parity_error trap (TT 0x071) will ever be 
generated. However, parity bits are still generated and written to the D-cache 
Parity SRAM. Therefore, when DPE is set, valid D-cache lines will 
automatically have correct parity bits. 


OBSDATA These bits are used to select the set of signals driven on the OBSDATA<9:0> 
pins of the processor for factory test purposes. 








5 





BPE 


Branch Prediction Enable. When BPE = 1, conditional branches are 
predicted through internal hardware. When BPE = 0, all branches are 
predicted not taken. After Power-On Reset initialization, this bit is set to 
zero. This bit is also automatically set to zero on any trap causing 
RED_state entry (but not cleared when privileged code enters 
RED_state by setting the RED bit in PSTATE). 
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TABLE 6-9 DCR Bit Description (Continued) 


Description 





Return Address Prediction Enable. When RPE = 0, the return address 
prediction stack is disabled. Even when encountering a JMPL instruction, 
instruction fetch will continue on a sequential path until the return address is 
generated and a mispredict is signalled. When RPE = 1, the processor may 
attempt to predict the target address of JMPL instructions and prefetch 
subsequent instructions accordingly. 

After Power-On Reset initialization, this bit is set to zero. This bit is also 
automatically set to zero on any trap causing a RED_state entry (but left 
unchanged when privileged code enters RED. state by setting 
PSTATE.RED). 





Instruction Dispatch Control 





3 SI Single Issue Disable. When SI = 0, only one instruction will be outstanding 
at a time. Superscalar instruction dispatch is disabled, and only one 
instruction is executed at a time. When SI = 1, normal pipelining is enabled. 
The processor can issue new instructions prior to the completion of 
previously issued instructions. 

After Power-On Reset initialization, this bit is set to zero. This bit is also 
automatically set to zero on any trap causing RED_state entry (but left 
unchanged when privileged code enters RED state by setting 
PSTATE.RED). 


Instruction Cache Parity Error Enable. If cleared, no parity checking at the 
Instruction Cache SRAM arrays (Data, Physical Tag, and Snoop Tag arrays) 
will be done. It also implies no Icache_Parity_error trap (TT 0x072) will 
ever be generated. However, parity bits are still generated and written to the 
I-cache Parity SRAM. Therefore, when IPE is set, valid I-cache lines will 
automatically have correct parity bits. 


Interrupt Floating-Point Operation Enable. The IFPOE bit enables system 
software to take interrupts on floating-point instructions. When set, the 
processor forces a fp. disabled trap when an interrupt occurs on 
floating-point code. 


Multiscalar dispatch enable. When MS = 0, the processor operates in scalar 
mode, issuing and executing one instruction at a time. Pipelined operation is 
still controlled by the SI bit. MS = 1 enables superscalar (normal) 
instruction issue. 


After Power-On Reset initialization, this bit is set to zero. The bit is also 
automatically set to zero on any trap causing RED, state entry (but left 
unchanged when privileged code enters RED. state by setting 
PSTATE.RED). 














Interrupt Floating-Point Operation Enable (Bit 1) 





The IFPOE bit enables system software to take interrupts on floating-point instructions. This 
enable bit is cleared by hardware at power-on. System software must set the bit as needed. 
When this bit is enabled, the UltraSPARC IIIi processor forces an fp. disabled trap when an 
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6.7.2 


interrupt occurs on FP-only code. The trap handler is then responsible for checking whether 
the floating-pint is indeed disabled. If it is not, the trap handler then enables interrupts to take 
the pending interrupt. 


Note — This behavior deviates from SPARC-V9 trap priorities in that interrupts are of lower 
priorities than the other two types of floating-point exceptions (fp exception ieee 754, 
fp. exception. other). 





This mechanism is triggered for an floating-point instruction only if none of the 
approximately twelve preceding instructions across the two integer, load/store, and branch 
pipelines are valid, under the assumption that they are better suited to take the interrupt 
(only one trap entry/exit). 








Upon entry, the handler must check both TSTATE.PEF and FPRS.FEF bits. If 
TSTATE.PEF = 1 and FPRF.FEF = 1, the handler has been entered because of an 
interrupt, either interrupt vector or interrupt level. In such a case: 
The fp. disabled handler should enable interrupts (that is, set PSTATE. IE = 1), then 
issue an integer instruction (for example, add $g0, $90, $90). An interrupt is 
triggered on this instruction. 
The processor then enters the appropriate interrupt handler (PSTATE 
here) for the type of interrupt. 
At the end of the handler, the interrupted instruction is a RETRY after returning from 
the interrupt. The add $90, $90, $90 is a RETRY. 
The fp. disabled handler then returns to the original process with a RETRY. 
The "interrupted" FPop is then retried (taking a fp. exception ieee 754 or 
fp. exception other at this time if needed). 









































F is turned off 





H 

















Graphics Status Register (GSR) ASR 19 


The GSR is used with the VIS Instruction Set. 


The GSR is accessible in non-privileged mode. It can be read and written using the RDASR 
and WRASR state register instructions. 


TABLE 6-10 GSR Tom ———7À 





RDASR EM rs] == 0x13 Read GSR 


WRASR 110000 rd == 0x13 Write GSR 
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op3 


25 24 


rs1 i20 cu 


19 18 14 13 12 0 


FIGURE 6-13 RDASR format 























10 rd | op3 rs1 | i20 — | rs2 
10 rd | op3 rs1 | i= ] simm13 
31 30 29 25 24 19 18 14 13 12 5 4 0 


FIGURE 6-14. WRASR format 


Suggested Assembly Language Syntax 





Accesses to this register cause an fp. disabled trap if PSTATE.PEF or FPRS.FEF are zero. 

















The format of the GSR is: 





MASK 








| IM | IAND arx smd = | SCALE | ALIGN 








63 


32 31 


28 27 26 25 24 23 8 7 3 2 0 


FIGURE 6-15 GSR Format (ASR 0x13) 
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TABLE 6-11 GSR Bit Description 

Bit Field Description 

63:32 MASK<31:0> [This field specifies the mask used by the BSHUFFLE instruction. The field 
contents are set by the BMASK instruction. 

31:28 Reserved 

27 IM Interval Mode: When IM = 1, the values in FSR.RD and FSR.NS are 
ignored; the processor operates as if FSR.NS = 0 and rounds floating-point 
results according to GSR. IRND. 
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6.7.3 


TABLE 6-11 


Field 


GSR Bit Description (Continued) 


Description 





26:25 


IRND<1:0> 


IEEE Std 754-1985 rounding direction to use in Interval Mode (GSR. IM 
= 1), as follows: 





IRND Round toward 





0 Nearest (even if tie) 




















When GSR. IM = 1, the value in GSR. IRND overrides the value in FSR.RD. 





GFX_STALL 


This field is for the flow control signal from the graphics devices that 
indicates the status of their input command queues, that could be read by user 
software without having a load go to the bus. (read-only) 


This has a big benefit in keeping a sustained pipeline of stores from the 
processor to the graphics devices, since you don’t have to wait for stores to 
drain, in order to get the load to complete. 


This pin is inverted polarity compared to the external pin (i.e., 0 = stall, 1 = 
do not stall) 





Reserved 





SCALE<4:0> 


Shift count in the range 0-31, used by the PACK instructions for formatting. 











ALIGN<2:0> 


Least three significant bits of the address computed by the last executed 
ALIGNADDRESS or ALIGNADDRESS_LITTLE instruction. 











Software Interrupt State Registers: 
ASRs 20, 21, and 22 


Three registers are used to control software interrupts: SOFTINT, SET_SOFTINT, and 
CLR_SOFTINT. Bits written to the SOFT INT register will cause traps to the level the trap is 
enabled. The SOF TINT register can be written to directly using ASR 22, or indirectly using 
the SET_SOFTINT and CLR_SOFTINT registers as described in this section. 
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All three registers are accessible only in privileged mode. The SOF TINT register is accessed 
using the RD and WR state register access instructions. The SET_SOFTINT and 
CLR_SOFTINT registers are written using the WR state register access instruction. See 


TABLE 6-12 


TABLE 6-12 





and FIGURE 6-16 for more details. 


Register-window State Registers 





Soft Interrupt Register ASR # Name and Description Privileged Access Instructions 


Software Interrupt Register RDSOFTINIT 
SOFTINT 22 
WRSOFTINT 





n 


ET_SOFTINT 


20 Sets Software Interrupt register bits. WRSOFTINIT_SET 














CLR_SOFTINT 








21 Clears Software Interrupt register bits. WRSOFTINIT_CLR 








0 


63 





63 














17 16 15 1 


SET_SOFTIN Reads zero, writes ignored. Sets bits in SOFTINT. 


17 16 0 


CLR_SOFTIN Reads zero, writes ignored. Clears bits in SOFTINT. 


63 17 16 0 


FIGURE 6-16 








SOFTINT, SET_SOFTINT, and CLR_SOFTINT Register Formats 


SOFTINT Register 


The operating system uses the SOF TINT to schedule interrupts. The field definitions are 
described in TABLE 6-13. 


TABLE 6-13 


SOFTINT Bit Descriptions 





Bit Field 


16 SM 
(STICK_INT) 
15:1 INT_LEVEL 
0 TM 
(TICK INT) 











100 


Description 

When the STICK COMPARE.INT DIS bit is zero (system tick compare is enabled) and 
its STICK CMPR field matches the value in the STICK register, then the SM field in 
SOFTINT is set to one and a Level-14 interrupt is generated. See Section 6.7.4 "Timer 
State Registers: ASRs 4, 23, 24, 25" on page 6-101 for details. 

When a bit is set within this field (bits 15:1), an interrupt is caused at the corresponding 
interrupt level. Note that INT LEVEL«I5» is shared by Level-15 interrupt and PIC 
overflow interrupt. 

When the TICK. COMPARE.INT DIS bit is zero (that is, tick compare is enabled) and its 
TICK CMPR field matches the value in the TICK register, then the TM field in the 
SOFTINT register is set to one and a Level-14 interrupt is generated. See Section 
"TICK. COMPARE Register” on page 6-102 for details. 
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SET_SOFTINT Register 


The SET_SOFTINT register is written to set bits in the SOF TINT register to set a bit in that 
register. When a bit in the SET_SOFTINT register is set to a one, the corresponding bit in 
the SOF TINT is set. 








CLR_SOFTINT Register 


The CLR_SOFTINT register is written in privileged mode using the WR write state register 
instruction to clear bits in the SOFTINT register. When a bit in the CLR_SOFTINT register 
is set to a one, the corresponding bit in the SOF TINT register is cleared. 


6.7.4 Timer State Registers: ASRs 4, 23, 24, 25 


The processor has two timers. The TICK timer is driven by the processor clock. The STICK 
timer is driven by the system clock. Four registers are used to implement the timer and 
support the timer interrupts. Timer state registers are described in TABLE 6-14. 


TABLE 6-14 Timer State Registers 


Soft Interrupt 

















Register ASR # (base 10) | Name and Description Access Instructions 

TICK 4 TICK register Depends 

STICK 24 STICK register Depends 

STICK COMPARE 25 STICK Compare register |State Register Instructions in privileged mode 








TICK NPT COUNTER 

63 62 0 
TICK COMPARE INT. DIS TICK CMPR 

63 62 0 
STICK NPT COUNTER 

63 62 0 
STICK_COMPARE [int pis TICK_CMPR 

63 62 0 


FIGURE 6-17 Timer State Registers 
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TICK Register 


The TICK register is a 63-bit counter that counts processor clock cycles. 


In privileged mode, the TICK register is always readable using either the RDPR (privileged 
read) or RDTICK (state register read) instructions. The TICK register is always write-able in 
privileged mode using the WRPR (privileged write) instruction; there is no WRTICK (state 
register write) instruction. 


The TICK.NPT bit (bit 63) selects the non-privileged mode readability. If TICK.NPT = 0, 
then the TICK register is readable in non-privileged mode using the RDTICK state register 
read instruction. When TICK.NPT = 1, an attempt by software to read the TICK register in 
non-privileged mode causes a privileged action exception. Software operating in 
non-privileged mode can never write to the TICK register. 





The TICK.NPT is set to one by a Power-On Reset trap. The value of TICK.COUNTER is 
reset after a Power-On Reset trap. 


After the TICK register is written, reading the TICK register returns a value incremented (by 
one or more) from the last value written, rather than from some previous value of the counter. 
The number of counts between a write and a subsequent read does not accurately reflect the 
number of processor cycles between the write and the read. Software may rely only on 
read-to-read counts of the TICK register for accurate timing, not on write-to-read counts. 


Note — The TICK register is unaffected by any reset other than a Power-On Reset. 





Programming Note — TICK.NPT may be used by a secure operating system to control 
access by user software to high-accuracy timing information. The operation of the timer 
might be emulated by the trap handler, which could read TICK.counter and change the 
value to lower its accuracy. 





TICK COMPARE Register 





The TICK COMPARE register causes the processor to generate a trap when the TICK 
register reaches the value in the TICK. COMPARE register and the INT DIS bit is zero. If 
the INT DIS bit is one, then no interrupt is generated. 





When the TICK CMPR field exactly matches the TICK.COUNTER field and INT DIS =0, 
then a TICK INT is posted in the SOFTINT register. This has the effect of posting a 
Level-14 interrupt to the processor when the processor has PIL register value less than 
fourteen and PSTATE.IE register field 1. 
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Programming Note — The Level-14 interrupt handler must check the SOFTINT<14>, TM 
(TICK INT),and SM(STICK INT) fields of the SOFTINT register to determine the 
source or sources of the Level-14 interrupt. 





In privileged mode, the TICK COMPARE register is always accessible using the state register 
read and write instructions. The TICK. COMPARE register is not accessible in non-privileged 
mode. Non-privileged accesses to this register causes a privileged opcode trap. 





STICK Register 


The STICK register 1s a 63-bit counter that increments at a rate determined by the system 
clock. 


The STICK register is always accessible in privileged mode using the RDSTICK and 
WRSTICK state register instructions. 





The STICK.NPT bit (bit 63) selects the non-privileged mode readability. If 

STICK.NPT =0, then the STICK register is readable in non-privileged mode using the 
RDSTICK state register read instruction. When STICK.NPT - 1, an attempt by software to 
read the STICK register in non-privileged mode causes a privileged action exception. 
Software operating in non-privileged mode can never write to the STICK register. 





The STICK.NPT bit is set to one by a Power-On Reset trap. The value of 
STICK.COUNTER is cleared after a Power-On Reset trap. 





After the STICK register is written, reading the STICK register returns a value incremented 
(by one or more) from the last value written, rather than from some previous value of the 
counter. 





Note — The STICK register is unaffected by any reset other than a Power-On Reset. 


STICK COMPARE Register 





The STICK COMPARE register causes the processor to generate a trap when the STICK 
register reaches the value in the STICK. COMPARE register and the INT DIS bit is zero. If 
the INT DIS bit is one, then no interrupt is generated. 








The STICK COMPARE is only accessible in privileged mode. Accesses to this register in 
non-privileged mode causes a privileged opcode trap. 
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When STICK_CMPR field exactly matches STICK. COUNTER field and INT DIS = 0, then 
a TICK INT is posted in the SOF TINT register. This has the effect of posting a Level-14 
interrupt to the processor when the processor has PIL register value less than fourteen and 
PSTATE.IE register field 1. 














Programming Note — The Level-14 interrupt handler must check SOFTINT<14>, 
TICK INT, and STICK INT to determine the source of the Level-14 interrupt. 





After a Power-On Reset trap, the INT. DIS bit is set to one (disabling system tick compare 
interrupts), and the STICK CMPR value is set to zero. 





6.8 


6.8.1 


6.8.1.1 
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Privileged Registers 


The privileged registers are described in this section. The privileged registers are visible only 
to software running in privileged mode (PSTATE.PRIV = 1). Privileged registers are written 
with the WRPR instruction and read with the RDPR instruction. 





Refer to FIGURE 6-5 on page 6-87 for more details. 


Trap Stack Privileged Registers 0 through 3 








The four trap stack registers (TPC, TNPC, TSTATE, and TT) form a group of registers that 

are shadowed for each of the five trap levels. Each instance of the registers save the state of 
key integer unit parameters at each trap level. FIGURE 6-18 shows the format for this register 
group. This figure is followed by a description of each register. FIGURE 6-19 shows how the 

register stack responds to an event example. 


The group of trap stack registers contain state information from the previous trap level. The 
registers include values from the program counter (PC), the next program counter (nPC), the 
trap state (TSTATE) register (a group of fields comprising the contents of the CCR, ASI, 
CWP, and PSTATE registers), and the trap type (TT) register containing the value of the trap 
that caused entry into the current trap level. 











Common Attributes 


There are MAXTL = 5 instances of the trap control registers, but only one group is accessible 
at any time. The current value in the TL register determines which instance of the trap 
control registers are accessible. 
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All trap control registers are accessible in privileged mode. An attempt to read or write any 
of these registers in non-privileged mode causes a privileged_opcode exception. 


An attempt to read or write any of these registers when TL = 0 causes an illegal instruction 
exception. 


TPC PC from trap while in trap level loo | 


63 21 0 

TNPC nPC from trap while in trap level [00 | 
63 0 
— Io I [rewr]. pe 
39 323124 23 20 19 87 32 0 

TT Trap Type 
8 y 


FIGURE 6-18 Trap State Register Format 


Trap Program Counter 


The Trap Program Counter (TPC) contains the PC from the previous trap level. 


Trap Next Program Counter 


The Trap Next Program Counter (TNPC) register is the nPC from the previous trap level. 


Trap State Register 


The Trap State (TS TATE) Register contains the state from the previous trap level, comprising 
the contents of the CCR, ASI, CWP, and PSTATE registers from the previous trap level. 








Trap Type 


The Trap Type (TT) register normally contains the trap type of the trap that caused entry to 
the current trap level. 
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6.8.1.2 


6.8.1.3 
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Trap Stack Operation 


The trap stack and an event example are illustrated in FIGURE 6-19. 


Event Example 


1) Processor is at TL = 1 
2) Processor traps 


3) Current PC, nPC, etc. written into TL = 1 group 


4) TL incremented to 2 
5) Processor returns from Trap 
6) 


TL = 1 group is written to PC, nPC, etc. 





FIGURE 6-19 Trap Stack and Event Example 


Effects of Reset and Normal Operation 


Trap Stack 













































































The effects of reset on each register are shown in TABLE 6-15. During normal operation, the 
trap stack register values defined for the trap levels above the current one are undefined. 


TABLE 6-15 Trap Stack Register Power-on and Normal Operation 





























During Normal Operation, 
Trap Control for n greater than the 
Register After Power-On Reset current trap level (n > TL) 
TPC TPC[O] = TPC[n] is undefined 
TPC[1] to TPC[5] are undefined 
TPC[0] = TNPC[n] is undefined 
TNPC 
TNPC[1] to TNPC[5] are undefined 
TPC[0] = TSTATE[n] is undefined 
TESTE TSTATE[1] to TSTATE[5] are undefined 
TPC[0] = Reset Trap Type TT[n] is undefined 
TI TT[1] to TT[4] are undefined 
TT[5] = 00116 
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6.8.2 


Trap Base Address (TBA) Privileged Register 5 


The TBA register, shown in FIGURE 6-20, provides the upper 49 bits of the address used to 
select the trap vector for a trap. The TBA register is accessible using read and write 
privileged register instructions. The lower 15 bits of the TBA always read as zero, and writes 
to them are ignored. 


Trap Base Address 000 0000 0000 0000 


15 14 0 


FIGURE 6-20 Trap Base Address Register 





The full address for a trap vector is specified by the contents in the TBA, TL, and TT [TL] 
registers at the time the trap is taken, as shown in FIGURE 6-21. 


TBA<63:15> TL>0 TTrL (00000 


63 


6.8.3 


15 14 13 54 0 


FIGURE 6-21 Trap Vector Address Format 


TL» 0 bit 


The “TL > 0” bit is zero if TL = 0 when the trap was taken, and one if TL > 0 when the trap 
was taken. This implies that there are two trap tables: one for traps from TL = 0 and one for 
traps from TL > 0. 


TT nz, field 


The TT; field is written with the contents of the TT register representing the new trap level 
that 1s being taken. 


Processor State (PSTATE) Privileged Register 6 


The PSTATE register, shown in FIGURE 6-22, holds the current state of the processor. There 
is only one instance of the PSTATE register. The PSTATE register is copied to a 12-bit field 
in the TSTATE register of the trap stack. 
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6.8.3.1 
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FIGURE 6-22 PSTATE Fields 








Writing PSTATE is nondelayed; that is, new machine state written to PSTATE is visible to 
the next instruction executed. The privileged RDPR and WRPR instructions are used to read 
and write all the bits in the PSTATE, respectively. 











Gl 


Subsections on page 108 through page 110 describe the fields contained in the PSTATI 
register. 


Global Register Set Selection - IG, MG, AG bits 


The UltraSPARC Ili processor provides Interrupt and MMU Global Register sets in addition 
to the two global register sets (normal and alternate) specified by SPARC-V9. The currently 
active set of global registers is specified by the AG, IG, and MG bits and are set and cleared 
according to the events listed in TABLE 6-16. 


Note — The IG, MG, and AG fields are saved on the trap stack along with the rest of the 
PSTATE Register. 





TABLE 6-16 PSTATE Global Register Selection Events 





PSTATE settings 
Event Globals selected for use AG IG MG 


DONE, RETRY [1] Global Registers encoded | 0 0 0 
in TSTATE register 
(Previous Global Registers 
before most recent trap) 





fast_instruction_access_MMU_umiss, MMU Global registers 0 0 1 
fast data access |. MMU miss, 
fast data access protection, 
data, access, exception, 
instruction access. exception 
































interrupt. vector. trap Interrupt Global registers 0 1 0 
Reserved [2] 0 1 1 

Write to privileged register (WPR) that Any Global Register x x x 

modifies AG, IG, or MG bits in PSTATE 

register 
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TABLE 6-16 PSTATE Global Register Selection Events 

















PSTATE settings 
Event Globals selected for use AG | IG MG 
Any trap other than those listed above Alternate Global registers | 1 0 0 
Reserved Reserved 1 0 1 
Reserved Reserved 1 1 0 
Reserved Reserved 1 1 1 























1. Since PSTATE is preserved in the TSTATE register when a trap occurs, the previous value of these bits are normally 
restored upon return from a trap (via DONE or RETRY instruction). 

2. A WRPR to PSTATE, using a reserved combination of AG, IG, and MG bit values, causes an illegal instruction ex- 
ception. 





Executing a DONE or RETRY instruction restores the previous (AG, IG, MG} state before the 


trap is taken. Programmers can also set or clear these three bits by writing to the PSTATE 
register with a WRPR instruction. 














Note — Attempting to use the “wrpr %pstate” instruction to set a reserved encoding for 
IG, MG, and AG (more than one of these bits set) results in an illegal instruction exception. 
However, the processor does not check for a reserved encoding when writing directly to the 
TSTATE register. Hence, executing a DONE or RETRY with an invalid AG, IG, MG bit 
combination may result in an undefined behavior of the processor. 




















Compatibility Note — The UltraSPARC Ili processor support two more sets (privileged 
only) of eight 64-bit global registers compared to the UltraSPARC II family: interrupt 
globals and MMU globals. These additional registers are called the trap globals. Two 1-bit 
fields, PSTATE.IG and PSTATE.MG, were added to the PSTATE register to select which 
set of global registers to use. 














PSTATE interrupt. globals (IG) 





When PSTATE. IG = 1, the processor interprets integer register numbers in the range 0 — 7 as 
referring to the interrupt global register set. See the Note on page 109. When an 
interrupt vector trap (trap type = 60,6) is taken, the processor sets IG and clears AG and MG. 


PSTATE MMU. globals (MG) 


When PSTATE.MG = 1, the processor interprets integer register numbers in the range 0 — 7 
as referring to the MMU global register set. 
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The processor sets PSTATE .MG and clears PSTATE.IG and PSTATE.AG when any of the 
following traps are taken: 


fast. instruction access MMU miss trap (trap type = 6416-6716) 
fast data access MMU miss trap (trap type = 68; 5-6B 6) 





fast data access protection trap (trap type = 6C46-6F6) 
data access exception trap (trap type = 3016) 


instruction access exception trap (trap type = 0816) 


PSTATE alternate globals (AG) 





When PSTATE.AG = 1, the processor interprets integer register numbers in the range 0 — 7 
as referring to the alternate global register set. 


If an exception is taken and it does not set another global bit, then the processor defaults to 
the Alternate Global register set by setting PSTATE.AG and clearing PSTATE.IG and 
PSTATE.MG. 











6.8.3.2 PSTATE current little endian (CLE) 














When PSTATE.CLE = 1, all data reads and writes using an implicit ASI are performed in 

little-endian byte order with an ASI of ASI PRIMARY LITTLE. When PSTATE.CLE - 0, 
all data reads and writes using an implicit ASI are performed in big-endian byte order with 
an ASI of ASI. PRIMARY. Instruction accesses are always big-endian. 

















6.8.3.3 PSTATE trap little endian (TLE) 


When a trap is taken, the current PSTATE register is pushed onto the trap stack and the 
PSTATE.TLE bit is copied into PSTATE.CLE in the new PSTATE register. This behavior 
allows system software to have a different implicit byte ordering than the current process. 
Thus, if PSTATE.TLE is set to one, data accesses using an implicit ASI in the trap handler 
are little-endian. The original state of PSTATE.CLE is restored when the original PSTATE 
register is restored from the trap stack. 
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6.8.3.4 


6.8.3.5 


6.8.3.6 


PSTATE_mem_model (MM) 


The processor supports Total Store Order (TSO) only. The 2-bit field in the PSTATE. MM is 
hardwired to 00 indicating TSO mode. See TABLE 6-17 for MM Encodings. 





TABLE 6-17 MM Encodings 


MM Value SPARC-V9 

00 Total Store Order (TSO) 
01 Reserved 

10 Reserved 











11 Reserved 








Total Store Order (TSO) — Loads are ordered with respect to earlier loads. Stores are 

ordered with respect to earlier loads and stores. Thus, loads can bypass earlier stores but 
cannot bypass earlier loads; stores cannot bypass earlier loads and stores. Programs that 

execute correctly in either PSO or RMO will execute correctly in the TSO model. 


PSTATE RED state (RED) 





PSTATE.RED (Reset, Error, and Debug state) is set whenever the UltraSPARC II 
processor takes a RED state disrupting or nondisrupting trap. The IU sets PSTATE.RED 
when any hardware reset occurs. It also sets PSTATE.RED when a trap is taken while 

TL = (MAXTL — 1). Software can exit RED, state by executing a DONE or RETRY 
instruction, which restores the stacked copy of PSTATE and clears PSTATE.RED if it was 
zero in the stacked copy. 










































































Note — Software can also exit the RED state by writing a zero to PSTATE. RED with a 
WRPR instruction. However, this method is not recommended due to potential side-effects 
and unpredictable behavior. 














PSTATE enable floating-point (PEF) 














When set to one, the PSTATE . PEF bit enables the FPU, which allows privileged software to 
manage the FPU. For the FPU to be usable, both PSTATE.PEF and FPRS.FEF must be set. 
Otherwise, any floating-point instruction that tries to reference the FPU causes a fp. disabled 
trap. 
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6.8.3.7 


6.8.3.8 


6.8.3.9 


6.8.4 
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PSTATE_address_mask (AM) 


When PSTATE.AM = 1, the high-order 32 bits of any virtual addresses for instruction and 
data are cleared to zero in the following cases: 





Before data addresses are sent out of the processor 
Before addresses are sent to the MMU 
For instruction accesses to all caches 
Before being stored to a general-purpose register for CALL, JMPL, and RDPC instructions 
Before being stored to TPC[n] and TNPC[n] when a trap occurs 
When an ASI PHYS * ASI is used in a load or store instruction, the setting of 


PSTATE.AM is ignored and the full 64-bit address is used. (See ASI 14,6, 
ASI PHYS USE EC, for an example). 




















When PSTATE.AM = 1, the processor writes the full 64-bit program counter value (upper 32 
bits are forced to be zero) to the destination register of a CALL, JMPL, or RDPC instruction. 








When PSTATE. AM - 1 and a trap occurs, the processor writes the full 64-bit program 
counter value to TPC[TL]. 





When PSTATE.AM = | and a synchronous exception occurs, the processor writes the full 
64-bit address to the Data Synchronous Fault Address Register. 

















When PSTATE.AM = 1 and an asynchronous exception occurs, the processor writes the full 
64-bit address to the Data Asynchronous Fault Address Register. 


The PSTATE.AM bit must be set when 32-bit software is executed. 





PSTATE privileged mode (PRIV) 





When PSTATE.PRIV = 1, the processor is in privileged mode. This bit is controlled by 
events in the processor and can be explicitly set. 


PSTATE interrupt enable (IE) 














When PSTATE.IE = 1, the processor can accept interrupts. 


Trap Level (TL) Privileged Register 7 


The trap level register, shown in FIGURE 6-23, specifies the current trap level. TL = 0 is the 
normal (nontrap) level of operation. TL > 0 implies that one or more traps are being 
processed. The maximum valid value that the TL register may contain is MAXTL = 5, which 
is always equal to the number of supported trap levels beyond Level-0. 
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6.8.5 


6.8.6 


«a 


2 0 


FIGURE 6-23 Trap Level Register 
Programming Note — Writing to the TL register with a value greater than MAXTL (five 
for the UltraSPARC IIi processor) causes the value MAXTL to be written. 


Writing the TL register with a wrpr $t1 instruction does not alter any other processor state; 
that is, it is not equivalent to taking or returning from a trap. 





Processor Interrupt Level (PIL) Privileged Register 8 


The processor interrupt level (PIL), illustrated in FIGURE 6-24, is the interrupt level above 
which the processor will accept an interrupt. Interrupt priorities are mapped so that interrupt 
Level-2 has greater priority than interrupt Level-1, and so on. 


PIL PIL 
3 n 
FIGURE 6-24 Processor Interrupt Level Register 


Compatibility Note — On SPARC-V8 processors, the Level-15 interrupt is considered to 
be nonmaskable, so it has different semantics from other interrupt levels. SPARC-V9 
processors do not treat Level-15 interrupts differently from other interrupt levels. 


Register- Window State Privileged Registers 9 
through 13 


The state of the register window is determined by a set of privileged registers that are read 
and written by privileged mode software using the RDPR and WRPR instructions, respectively. 
In addition, these privileged registers are modified by instructions related to register 
windowing and are used to generate traps that allow supervisor software to spill, fill, and 
clean the register window sets. TABLE 6-18 describes the register-window state privileged 
registers. 


Chapter 6 Registers 113 


Register-window management is described in a separate chapter. 


TABLE 6-18 Register-Window State Privileged Registers 




















Value 
Register-window State Registers Range Description 
Current Window Pointer State Register 9: The CWP register is a counter that identifies 
the current window into the set of integer registers. 
CWP 0 to7 
Savable Window Sets State Register 10: The CANSAVE register contains the 
number of register sets following CWP that are not in use and 
CANSAVE 0 to6 | are available to be allocated by a SAVE instruction without 
generating a window spill exception. 
Restorable Window Sets State Register 11: The CANRESTORE register contains the 
number of register sets preceding CWP that are in use by the 
CANRESTORE 0 to 7 current program and can be restored (by the RESTORE 
instruction) without generating a window fill exception. 
Clean Window Sets State Register 12: The CLEANWIN register contains the 
number of windows that can be used by the SAVE instruction 
Oto 6 | without causing a clean window exception. 
CLEANWIN 
State Register 13: The OTHERWIN register contains the 
Other Window Sets count of register sets that will be spilled/filled by a separate 
set of trap vectors based on the contents of WSTATE OTHER. 
|. e] 0 to 7 If OTHERWIN is zero, register sets are spilled/filled by use of 
OTHERWIN z o trap vectors based on the contents of WSTATE NORMAL. 
The OTHERWIN register can be used to split the register sets 
among different address spaces and handle spill/fill traps 
efficiently by use of separate spill/fill vectors. 








Note — The CWP, CANSAV 








E, CANR 





ESTORE, OTHERWIN, and CLEANWIN registers contain 














values in the range 0 to 7 or 0 to 6 as indicated in TABLE 6-18. The effect of writing a value 
greater than indicated to any of these registers is undefined. The values programmed into 
these registers must combine into a consistent set of numbers that will work. 





Note — The most significant 61 bits of all these registers are set to zero. When any are 
written, the most significant 61 bits are ignored. 
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6.8.7 


Compatibility Note — The following differences between SPARC-V8 and SPARC-V9 are 


visible only to privileged software; they are invisible to non-privileged software. 

















1. In SPARC-V9, SAVE increments CWP and RESTORE decrements CWP. In SPARC-V8, the 
opposite is true: SAVE decrements PSR.CWP and RESTORE increments PSR. CWP. 

















2. PSR. CWP in SPARC-V8 is changed on each trap. In SPARC-V9, CWP is affected only by 
a trap caused by a window fill or spill exception. 





Clean Windows (CLEANWIN) Register Note 





The CLEANWIN register counts the number of register window sets that are “clean” with 
respect to the current program, that is, register sets that contain only zeroes, valid addresses, 
or valid data from that program. Registers in these windows need not be cleaned before they 
can be used. The count includes the register sets that can be restored (the value in the 
CANRESTORE register) and the register sets following CWP that can be used without 
cleaning. When a clean window is requested (by a SAVE instruction) and none is available, a 
clean window exception occurs to cause the next window to be cleaned. 




















Programming Note — CLEANWIN must never be set to a value greater than six. Setting 
CLEANWIN greater than six would violate the register window state definition. Notice that 
the hardware does not enforce this restriction; it is up to Supervisor software to keep the 
window state consistent. 











Window State (WSTATE) Privileged Register 14 


The WSTATE register, shown in FIGURE 6-25, specifies bits that are inserted into TT T; «4:27 
on traps caused by window spill and fill exceptions. 





This register is read/write by using the RDPR and WRPR privileged instructions. 


These bits are used to select one of eight different window spill and fill handlers. If 
OTHERWIN - 0 at the time a trap is taken because of a window spill or window fill 
exception, then the WSTATE . NORMAL bits are inserted into TT [TL] field of the Trap Vector 
Address. Otherwise, the WSTATE . OTHER bits are inserted into TT [TI]. 
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WSTATE OTHER NORMAL 


5 3 2 0 


FIGURE 6-25 WSTATE Register 





6.8.8 Version (VER) Privileged Register 31 


The version register, shown in FIGURE 6-26, specifies the fixed parameters pertaining to a 
particular processor implementation and mask set. 





The VER register is read-only, readable by the RDPR privileged instruction. 


manufacturer = 003E46 impl mask 0000 0000 | maxtl = 5 |000| maxwin = 7 
63 48 47 32 81 24 23 16 15 8754 0 


FIGURE 6-26 Version Register 


VER.manuf field 





The VER.manuf field contains Sun’s 16-bit manufacturer code, 003E;6, which is Sun’s 
JEDEC semiconductor manufacturer code. 


VER.impl field 





The VER. imp1 field uniquely identifies the processor implementation or class of software- 
compatible implementations of the architecture. TABLE 6-19 shows the processor 
implementation codes. 


TABLE 6-19 Processor Implementation Codes 


Processor VER.impl 


UltraSPARC I 001046 
UltraSPARC II 





UltraSPARC IIi 





UltraSPARC Ile 
UltraSPARC Hi 
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VER.mask field 


The VER.mask specifies the current mask set revision and is chosen by the implementor. It 
generally increases numerically with successive releases of the processor but does not 
necessarily increase by one for consecutive releases. TABLE 6-20 shows the UltraSPARC IIIi 
Processor Mask Version. 





TABLE 6-20 UltraSPARC IIIi Processor Mask Version Codes 





Mask Version VER.mask 





TO l.x 4 hl 





TO 2.x 





VER.maxtl field 


The VER.maxt1 value, 5, is the maximum number of trap levels supported by the 
processor. 





VER.maxwin field 


The VER.maxwin value, 7, is the maximum number of Integer Unit register windows that 
access the NWINDOWS = 8 window register sets. 








6.9 


6.9.1 


Special Access Register 


Floating-Point Status Register (FSR) 


The FSR register fields, illustrated in FIGURE 6-26, contain FPU mode and status information. 
State information about the FPU is presented in section Section 6.6.6 "Floating-Point 
Registers State (FPRS) Register 6" on page 6-93. 


The FSR is accessible using special load and store opcodes. They work in privileged and 
non-privileged mode. The lower 32 bits of the FSR are read and written by the STFSRP and 
LDFSRP floating-point instructions; all 64 bits of the FSR are read and written by the 
STXFSR and LDXFSR floating-point instructions, respectively. FIGURE 6-27 illustrates the 
FSR fields. 


The ver, ftt, and reserved (“—”) fields are not modified by LDFSR or LDXFSR, they 
are read-only fields. 
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— fcc3 | fcc2 | fcc 


63 38 37 36 35 34 33 32 
RD} — TEM NS — ver ftt 0 |—| fccO aexc Cexc 
31 30 29 28 27 23 22 21 20 19 17 16 14 13 12 11 10 9 5 4 0 


FIGURE 6-27 FSR Fields 


Reserved Bits 


Bits 63—38, 29-28, 21—20, and 12 are reserved. When read by a STXFSR instruction, these 
bits will read as zero. Software should issue LDXFSR instructions only with zero values in 
these bits, unless the values of these bits are exactly those derived from a previous STXFSR. 


The subsections on pages page 118 through page 126 describe the remaining fields in the 
FSR. 


6.9.1.1 FSR . fp. condition, codes (fcc0, fccl, fcc2, fcc3) 


The four sets of floating-point condition code fields are labeled £cc0, £cc1, fcc2, and 
Lego 


Compatibility Note — fcc0 defined in SPARC-V9 is the same as fcc defined in 
SPARC-V8. 





The feco field consists of bits 11 and 10 of the FSR, fcc1 consists of bits 33 and 32, 
fcc2 consists of bits 35 and 34, and fcc3 consists of bits 37 and 36. Execution of a 
floating-point compare instruction (FCMP or FCMPE) updates one of the fccn fields in the 
FSR, as selected by the instruction. The fccn fields can be read and written by STXFSR and 
LDXF SR instructions, respectively. The £ccO field can also be read and written by STFSR 
and LDFSR, respectively. FBfcc and FBPfcc instructions base their control transfers on 
these fields. The MOVcc and FMOVcc instructions can conditionally copy a register, based on 
the state of these fields. 
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6.9.1.2 


6.9.1.3 


In TABLE 6-21, f,;; and f,» correspond to the single, double, or quad values in the floating- 
point registers specified by a floating-point compare instruction's rs1 and rs2 fields. The 
question mark (?) indicates an unordered relation, which is true if either f,,; or f,» is a 
signalling NaN or a quiet NaN. If FCMP or FCMPE generates an fp. exception ieee 754 
exception, then £ccn is unchanged. TABLE 6-21 shows the floating-point condition codes. 





TABLE 6-21 Floating-Point Condition Codes (fccn) Fields of FSR 

















Content of fccn Indicated Relation 

: fist =fis2 

: fisi < frs2 

2 fi > foa 

? fisi ? fio (unordered) 











FSR_rounding_direction (RD) 


Bits 31 and 30 select the rounding direction for floating-point results according to 
IEEE Std 754-1985. TABLE 6-22 shows the rounding direction fields. 


TABLE 6-22 Rounding Direction (RD) Field of FSR 








RD Round Toward 

0 Nearest (even, if tie) 
1 0 

2 +00 

3 — oo 











If GSR. IM- 1, then the value of FSR.RD is ignored and floating-point results are instead 
rounded according to GSR. IRND. 


FSR_nonstandard_fp (NS) 


The NS bit allows the processor to flush a subnormal floating-point value to zero. If a 
floating-point add/subtract operation results in a subnormal value and FSR.NS = 1, the value 
is replaced by a floating-point zero value of the same sign. This replacement is usually 
performed in hardware. However, for the following cases when a subnormal value is 
generated in the course of the instruction and FSR.NS = 1, an fp exception other exception 
with FSR.ftt =2 (unfinished FPop) is taken and trap handler software is expected to 
replace the subnormal value with a zero value of the appropriate sign: 


fadd of numbers with opposite signs 
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6.9.1.4 


6.9.1.5 


120 


fsub of numbers with the same signs 
fdtos 


The effects of FSR.NS = 1 are as follows: 


If a floating-point source operand is subnormal, it is replaced by a floating-point zero 
value of the same sign (instead of causing an exception). 


If a floating-point operation generates a subnormal value, the value is replaced with a 
floating-point zero value of the same sign. 


This is implemented by performing the replacement in hardware, and sometimes cause a 
fp. exception other exception with FSR. ftt — 2 (unfinished FPop) so that trap handler 
software can perform the replacement. 


If GSR. IM = 1, then the value of FSR.NS is ignored and the processor operates as if 
FSR.NS - 0. 


FSR version (ver) 
Version number 7 is reserved to indicate that no hardware floating-point controller is present. 


The ver field is read-only; it cannot be modified by the LDFSR and LDXFSR instructions. 


FSR floating-point trap type (ftt) 


When a floating-point exception trap occurs, ftt (bits 16 through 14 of the FSR) identifies 
the cause of the exception, the “floating-point trap type.” Several conditions can cause a 
floating-point exception trap. After a floating-point exception occurs, the £tt field encodes 
the type of the floating-point exception until a STFSR or FPop is executed. 


The ftt field can be read by the LDFSR and LDXFSR instructions. The STFSR and STXFSR 
instructions do not affect £tt because this field is read-only. 


Privileged software that handles floating-point traps must execute a STFSR (or STXFSR) to 
determine the floating-point trap type. STFSR and STXFSR clears the ftt bit after the store 
completes without error. If the store generates an error and does not complete, ftt remains 
unchanged. 
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Programming Note — Neither LDFSR nor LDXFSR can be used for the purpose of 


clearing ftt, since both leave ftt unchanged. However, executing a non-trapping FPop 
such as “fmovs $f0,£f0" prior to returning to non-privileged mode will zero ftt. The 


ftt remains valid until the next FPop instruction completes execution. 


The £tt field encodes the floating-point trap type according to TABLE 6-23. Note: The value 
“7” is reserved for future expansion. 


TABLE 6-23 Floating-Point Trap Type (ftt) Field of FSR) 


ftt Trap Type 


Trap Vector 





0 None 
1 IEEE 754 exception 


No trap taken 


fp. exception ieee 754 





unfinished FPop 


unimplemented_FPop 


fp. exception other 


fp. exception other 





hardware. error 


nvalid fp register 


Reserved, Unimplemented 
Reserved, Unimplemented 


Reserved, Unimplemented 





2 
3 
4 sequence error 
5 
6 
7 


Reserved 








IEEE 754 exception, unfinished FPop, and unimplemented FPop will likely arise 
occasionally in the normal course of computation and must be recoverable by system 


software. 


When a floating-point trap occurs, the following results are observed by user software: 
1. The value of aexc is unchanged. See Section 6.9.1.6 for details of aexc. 


2. The value of cexc is unchanged, except for an IEEE 754 exception, where a bit 
corresponding to the trapping exception is set. The unfinished FPop, 





Reserved, Unimplemented 





unimplemented_F Pop, sequence error, and invalid fp register floating-point trap types 


do not affect cexc. See Section 6.9.1.6 for details of cexc. 


3. The source and destination registers are unchanged. 


4. The value of £ccn is unchanged. 


The foregoing describes the result seen by a user trap handler if an IEEE exception is 
signalled, either immediately from an IEEE 754 exception or after recovery from an 


unfinished FPop or unimplemented FPop. In either case, cexc as seen by the trap handler 
reflects the exception causing the trap. 


In the cases of fp. exception other exceptions with unfinished FPop or unimplemented_F Pop 
trap types that do not subsequently generate IEEE traps, the recovery software should define 


cexc, aexc, and the destination registers or fccs, as appropriate. 
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ftt = IEEE 754 exception. The IEEE 754 exception floating-point trap type indicates 
the occurrence of a floating-point exception conforming to IEEE Std 754-1985. The 
exception type is encoded in the cexc field. 


The aexc and fccs fields and the destination f register are not affected by an 
IEEE 754 exception trap. 


ftt = unfinished FPop. The unfinished FPop floating-point trap type indicates that the 
processor was unable to generate correct results or that exceptions as defined by 

IEEE Std 754-1985 have occurred. Where exceptions have occurred, the cexc field is 
unchanged. 


The conditions under which a fp. exception other exception with floating-point trap type of 
unfinished FPop can occur are implementation dependent. The recommended set of 
conditions is shown in TABLE 6-24. An implementation may cause fp. exception other with 
unfinished FPop under a different (but specified) set of conditions. 


TABLE 6-24 Standard Conditions Under Which unfinished FPop Trap Type 















































Can Occur 
1 subnormal (SBN) 2 subnormal (SBN) 
FPU operand operands Result/Non-SBN Operand 
Operation IM = 1 or NS=0 IM=1 or NS=0 IM=1 or NS=0 
fadds Unfinished trap Unfinished trap fi fv, fu, sbn (IM = NS = x) 
NaN (either operand) 
fsubs Unfinished trap Unfinished trap fi fv, fu, sbn (IM = NS = x) 
NaN (either operand) 
faddd Unfinished trap Unfinished trap fi fv, fu, sbn (IM = NS = x) 
NaN (either operand) 
fsubd Unfinished trap Unfinished trap fi fv, fu, sbn (IM = NS = x) 
NaN (either operand) 
fmuls Unfinished trap if Unfinished trap if -25 < Er <= 1 
- result not zero - result not zero 
fdivs Unfinished trap Unfinished trap -25 < Er <= 1 
fsmuld Unfinished trap Unfinished trap None 
fmuld Unfinished trap if Unfinished trap if -54 < Er <= 1 
- result not zero - result not zero 
fdivd Unfinished trap Unfinished trap -54 < Er <= 1 
fsqrts Unfinished trap N/A None 
fsqrtd Unfinished trap N/A None 
fstoi Unfinished trap N/A - 23! <= res < 21, Infinity, NaN 
fdtoi Unfinished trap N/A - 23! <= res < 2°!, Infinity, NaN 
fstox Unfinished trap N/A Iresultl >= -2°7, Infinity, NaN 
fdtox Unfinished trap N/A Iresultl >= -2??, Infinity, NaN 
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TABLE 6-24 Standard Conditions Under Which unfinished FPop Trap Type 
Can Occur (Continued) 














1 subnormal (SBN) 2 subnormal (SBN) 
FPU operand operands Result/Non-SBN Operand 
Operation IM = 1 or NS=0 IM 2 1 or NS=0 IM 2 1 or NS: 0 
fitos N/A N/A - 22 < operand < 2? 
fxtos N/A N/A - 2°? < operand < 2? 
fitod N/A N/A None 
fxtod N/A N/A - 2?! < operand < 2?! 
fstod Unfinished trap N/A NaN 
fdtos Unfinished trap N/A fi fv, fu, son (IM = NS = x), NaN 




















6.9.1.6 


Note: 
Er < Biased Exponent of the result before rounding 
Ei < Biased Exponent of input operand 
fi € Invalid(Infinity — Infinity, Infinity*0, 0/0, Infinity/Infinity) 
fv << OverflowEr >= 2047(DP) or 255(SP) but not exact infinity 
fu < Underflow0 < Iresultl < 271922(DP) or 27126(SP) 
sbnormal(sbn): Inumberl = 2:1022 x (significand x 2:32) (DP) or 2:126 x (significand x 223) (SP) 
{-54 < Er < 1 (DP) or -25 < Er < 1 (SP)} 





ftt = unimplemented_F Pop. The unimplemented_FPop floating-point trap type indicates 
that the processor decoded an FPop that it does not implement. In this case, the cexc field 
is unchanged. 


All guad FPops variations set £tt — unimplemented_FPop. 


Floating-Point Exceptions Control and Status 


There are three FSR register fields used to control and status the events associated with 
floating-point exceptions. 


FSR_trap_enable_mask (TEM) 


Bits 27 through 23 are enable bits for each of the five IEEE-754 floating-point exceptions 
that can be indicated in the current_exception field (cexc). See FIGURE 6-28 for an 
illustration. If a floating-point operate instruction generates one or more exceptions and the 
TEM bit corresponding to any of the exceptions is one, then this condition causes a 

fp. exception ieee 754 trap. A TEM bit value of zero prevents the corresponding exception 
type from generating a trap. 
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NVM | OFM | UFM | DZM | NXM 


27 26 25 24 23 


FIGURE 6-28 Trap Enable Mask (TEM) Fields of FSR 





FSR_accrued_exception (aexc) 


Bits 9 through 5 accumulate IEEE-754 floating-point exceptions as long as floating-point 
exception traps are disabled through the TEM field. See FIGURE 6-29 for an illustration. After 
an FPop completes with £tt = 0, the TEM and cexc fields are logically ANDed together. If 
the result is nonzero, aexc is left unchanged and a fp. exception ieee 754 trap is generated; 
otherwise, the new cexc field is ORed into the aexc field and no trap is generated. Thus, 
while (and only while) traps are masked, exceptions are accumulated in the aexc field. 








This field is also written with the appropriate value when an LDFSR or LDXFSR instruction 
is executed. 


FIGURE 6-29 Accrued Exception Bits (aexc) Fields of FSR 


FSR_current_exception (cexc) 


Bits 4 through 0 indicate that one or more IEEE-754 floating-point exceptions were 
generated by the most recently executed FPop instruction. The absence of an exception 
causes the corresponding bit to be cleared. See FIGURE 6-30 for an illustration. 


FIGURE 6-30 Current Exception Bits (cexc) Fields of FSR 


Note — If the FPop traps and software emulate or finish the instruction, the system software 
in the trap handler is responsible for creating a correct FSR.cexc value before returning to 
a non-privileged program. 
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The cexc bits are set as described in Section 6.9.1.7, “Floating-Point Exception Fields”,” by 
the execution of an FPop that either does not cause a trap or causes a fp exception ieee 754 
exception with FSR. ftt = IEEE 754 exception. An IEEE 754 exception that traps shall 
cause exactly one bit in FSR.cexc to be set, corresponding to the detected IEEE Std 754 


exception. 


Floating-point operations which cause an overflow or underflow condition may also cause an 
“inexact” condition. For overflow and underflow conditions, FSR.cexc bits are set and 
trapping occurs as follows: 


- AnIEEE 
If OF 


754 overflow condition (of) occurs: 
= 0 and NXM = O, the cexc.ofc and cexc.nxc bits are both set to one, the 


other three bits of cexc are set to zero, and a fp. exception ieee 754 trap does not 


= 0 and NXM = 1,the cexc.nxc bit is set to one, the other four bits of cexc 


are set to zero, and a fp exception ieee 754 trap does occur. 


= 1, the cexc.ofc bit is set to one, the other four bits of cexc are set to zero, 


and a fp exception ieee 754 trap does occur. 


other t 


are set 








754 underflow condition (uf) occurs: 
= 0 and NXM = O, the cexc.ufc and cexc.nxc bits are both set to one, the 
hree bits of cexc are set to zero, and a fp. exception ieee 754 trap does not 


= 0 and NXM = 1, the cexc.nxc bit is set to one, the other four bits of cexc 
to zero, and a fp. exception ieee 754 trap does occur. 





IfUF 


= 1, the cexc.ufc bit is set to one, the other four bits of cexc are set to zero, 


and a fp exception ieee 754 trap does occur. 


The behavior is summarized in TABLE 6-25 (where “x” indicates “don’t care”): 


TABLE 6-25 Setting of FSR.cexc bits 


Exception(s) 
Detected 
in f.p. 
operation 
















Current 





Trap Enable Exception 
Mask bits à bits (in 
(in FSR. TEM) Jp. exception FSR.cexc) 





ieee 754 
Trap Occurs? 











Notes: 























(1) When the 
inexact. 
(2) Overflow 
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underflow trap is disabled (UFM = 0), underflow is always accompanied by 





is always accompanied by inexact. 
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TABLE 6-25 Setting of FSR.cexc bits (Continued) 

















Exception(s) Current 
Detected Trap Enable Exception 
in f.p. Mask bits ^ R bits (in 

operation (in FSR. TEM) Jp. exception. . FSR.cexc) 












ieee 754 
Trap Occurs? 
































Notes: 


(1) When the underflow trap is disabled (UFM = 0), underflow is always accompanied by 
inexact. 
(2) Overflow is always accompanied by inexact. 











If the execution of an FPop causes a trap other than fp. exception ieee 754, FSR.cexc is 
left unchanged. 


Floating-Point Exception Fields 


The current and accrued exception fields and the trap enable mask assume the following 
definitions of the floating-point exception conditions (per IEEE Std 754-1985): 


FSR invalid (nvc, nva) 


An operand is improper for the operation to be performed. For example, 0.0 + 0.0 and œ% — co 
are invalid; 1 = invalid operand(s), 0 = valid operand(s). 


FSR overflow (ofc, ofa) 


The result, rounded as if the exponent range were unbounded, would be larger in magnitude 
than the destination format's largest finite number; 1 — overflow, 0 — no overflow. 


UltraSPARC Illi Processor User's Manual * June 2003 


FSR_underflow (ufc, ufa) 


The rounded result is inexact and would be smaller in magnitude than the smallest 
normalized number in the indicated format; 1 = underflow, 0 = no underflow. 


Underflow is never indicated when the correct unrounded result is zero. Otherwise: 


If UFM = 0, underflow occurs if a nonzero result is tiny and a loss of accuracy occurs. 





If UFM = 1, underflow occurs if a nonzero result is tiny. 


SPARC-V9 allows underflow to be detected either before or after rounding. The 
UltraSPARC IIi processor detects underflow before rounding. 


FSR division-by-zero (dzc, dza) 


X + 0.0, where X is subnormal or normalized; 1 = division by zero, 0 = no division by zero. 


Note — 0.0 = 0.0 does not set the dzc or dza bits. 





FSR inexact (nxc, nxa) 


The rounded result of an operation differs from the infinitely precise unrounded result; 
1 = inexact result, 0 = exact result. 





Programming Note — Software must be capable of simulating the operation of the FPU 
in order to properly handle the unimplemented_FPop, unfinished_FPop, and 

IEEE 754 exception floating-point trap types. Thus, a user application program always sees 
a FSR that is fully compliant with IEEE Std 754-1985. 








6.10 


6.10.1 


ASI Mapped Registers 


In this section, the Data Cache Unit Control Register and Data Watchpoint registers (virtual 
address data watchpoint and physical address data watchpoint) are described. 


Data Cache Unit Control Register (DCUCR) 


ASI 45,6 (ASI DCU CONTROL REGISTER), VA = 016 
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The DCUCR contains fields that control several memory-related hardware functions. The 
functions include instruction, prefetch, write and data caches, MMUs, and watchpoint 
setting. 


After a Power-On Reset (POR), all fields of DCUCR are set to zero. After a WDR, XIR, or 
SIR, all fields of DCUCR defined in this section are set to zero. 


The DCUCR is illustrated in FIGURE 6-31 and described in TABLE 6-26. In the table, the field 
definitions and bits are grouped by function rather than by a strict bit sequence. 


63 62 61 60 59 58 57 56 55 54 53 52 51 50 49 48 47 46 45 44 43 42 41 BED 39 38 37 36 35 34 33 32 


BENOD 


49: CP (physically cacheable)! | [Las WE h 

48: CV (virtually dicis | 42: SL inis ids : 

47: ME (merge enable) SNC 
43: SPE (sw prefetch) 


46: RE (RAW bypass) 44: HPE (i _ 
45: PE (prefetch enable) (hw prefetch) 





31 30 29 28 27 26 25 24 23 22 21 20 19 18 17 16 15 14 13 12 1110 98 7 6 5 4 3 2 1 
SEEK A e [pf] 
31, 31:25: VM (Datawatch) E 
24: PR (PA watch, read) 3: DM (D-MMU enable) 
23: PW (PA watch, rwrite) 2: IM (I-MMU enable) 
22: VR (VA watch, read) 1: DC (D-cache enable) 
21: VW (VA watch, write) 0: IC (I-cache enable) 


FIGURE 6-31 DCU Control Register Access Data Format (ASI 45,6) 


TABLE 6-26 DCUCR Bit Field Descriptions (7 of 4) 











Bits Field Type | Description Note 
63:50, Reserved RW 

20:4 

MMU Control 

49 CP Cacheability of PA. CP determines the physical cacheability of memory | 1 


accesses when the I-MMU or D-MMU is disabled (IM = 0 or DM = 0). 
The TTE.E (side-effect) bit is set to the complement of CP when MMUs 
are enabled; 1 = cacheable, 0 = non-cacheable. 
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TABLE 6-26 DCUCR Bit Field Descriptions (2 of 4) 





























Second Load Control 





disabled, software prefetch instructions do no generate a request to the 
L2-cache or the system interface. They will continue to be issued to the 
pipeline, where they will be treated as NOPs. 


Bits Field Type | Description Note 
48 CV RW | Cacheability of VA. CV determines the virtual cacheability of memory 
accesses when the D-MMU is disabled (DM = 0); 
1 = cacheable, 0 = non-cacheable. 
3 D D-MMU Enable. If DM = 0, the D-MMU is disabled (pass-through mode), 
Note: When the MMU/TLB is disabled, a virtual address is passed 
through as a physical address. 
2 I I-MMU Enable. If IM=0, the I-MMU is disabled (pass-through mode). 
Store Queue Control 
47 ME Non-cacheable Store Merging Enable. If cleared, no merging of 
non-cacheable, non-side-effect store data will occur. Each non-cacheable 
store will generate a system bus transaction. 
46 RE RAW Bypass Enable. If cleared, no bypassing of data from the store 
queue to a dependent load instruction will occur. All load instructions 
will have their RAW predict field cleared. 
Prefetch Control 2 
45 PE Prefetch Cache Enable. If prefetch is disabled by clearing the PE bit, all 
references to the P-cache are handled as P-cache misses. If cleared, the 
P-cache does not generate any hardware prefetch requests to the 
L2-cache. Software prefetch instructions are not affected by this bit. 
44 HPE Prefetch Cache Hardware Prefetch Enable. 3 
43 SPE Software Prefetch Enable. Clear to disable prefetch instructions. When 





42 


SL 


Second Load Steering Enable. If cleared, all load type instructions will be 
steered to the MS pipeline and no floating-point load type instructions 
will be issued to the AO or A1 pipelines. 





41 





WE 





I-cache, D-cache, and W-cache Control 


Write Cache Enable. If zero, all W-cache references will be handled as 
W-cache misses. Each store queue entry will perform a RMW transaction 
to the L2-cache, and the W-cache will be maintained in a clean state. 
Software is required to flush the W-cache (force it to a clean state) before 





setting this bit to zero. 
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Bits 


TABLE 6-26 DCUCR Bit Field Descriptions (3 of 4) 


Field 


Description 


Note 





DC 


IC 





Data Cache Enable. The DC is used to enable/disable the operation of the 
data cache closest to the processor (D-cache); DC = 1 enables the 
D-cache and DC = 0 disables it. When DC = 0, memory accesses (loads, 
stores, atomic load-stores) are satisfied by caches lower in the cache 
hierarchy. 

When the D-cache is disabled, its contents are not updated. When the 
D-cache is reenabled, any D-cache lines still marked as “valid” may be 
inconsistent with the state of memory or other caches. In that case, 
software must handle any inconsistencies by flushing the inconsistent 
lines from the D-cache. 


Instruction Cache Enable. The 1C is used to enable/disable the operation 
of the instruction cache closest to the processor (I-cache); IC = 1 enables 
the I-cache and IC = 0 disables it. When IC = O, instruction fetches are 
satisfied by caches lower in the cache hierarchy. 

When the I-cache is disabled, its contents are not updated. When the 
I-cache is reenabled, any I-cache lines still marked as “valid” may be 
inconsistent with the state of memory or other caches. In that case, 
software must handle any inconsistencies by invalidating the inconsistent 
lines in the I-cache. 








40:33 








Watchpoint Control 


PM<7:0> 


DCU Physical Address Data Watchpoint Mask. The Physical Address 
Data Watchpoint Register contains the physical address of a 64-bit word 
to be watched. The 8-bit Physical Address Data Watch Point Mask 
controls which byte(s) within the 64-bit word should be watched. If all 
eight bits are cleared, the physical watchpoint is disabled. If the 
watchpoint is enabled and a data reference overlaps any of the watched 
bytes in the watchpoint mask, then a physical watchpoint trap is 
generated. Watchpoint behavior for a Partial Store instruction may differ. 
Please see the VM field description in the table. 
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TABLE 6-26 DCUCR Bit Field Descriptions (4 of 4) 























Bits Field Description Note 
32:25 VM<7:0> DCU Virtual Address Data Watchpoint Mask. The Virtual Address Data | 4 
Watchpoint Register contains the virtual address of a 64-bit word to be 
watched. This 8-bit mask controls which byte(s) within the 64-bit word 
should be watched. If all eight bits are cleared, then the virtual 
watchpoint is disabled. If watchpoint is enabled and a data reference 
overlaps any of the watched bytes in the watchpoint mask, then a virtual 
watchpoint trap is generated. 
VA/PA data watchpoint byte mask examples are shown below. 
Least Significant 3 Bits of 
Watchpoint Mask Address of Bytes Watched 
(PM and VM) 7654 3210 
0016 Watchpoint disabled 
Oli 0000 0001 
3216 0011 0010 
FFi¢ 1111 1111 
24, 23 PR, PW DCU Physical Address Data Watchpoint Enable. If PR (PW) is one, then 
a data read (write) that matches the range of addresses in the Physical 
Watchpoint Register causes a watchpoint trap. If both PR and PW are set, 
a watchpoint trap will occur on either a read or write access. 
22, 21 VR, VW DCU Virtual Address Data Watchpoint Enable. If VR (VW) is one, then a 
data read (write) that matches the range of addresses in the Virtual 
Watchpoint Register causes a watchpoint trap. If both VR and VW are set, 
a watchpoint trap will occur on either a read or write access. 











. The CP and CV bits of DCUCR must be changed with care. It is recommended that a MEMBAR #Sync be executed before and after 
CP or CV is changed. Also, software must manage cache states to be consistent before and after CP or CV is changed. 


. Prefetch is enabled in the UltraSPARC Ii processor. Both hardware prefetch and software prefetch for data to the P-cache are valid only 
for floating-point load instructions and are not valid for integer load instructions. 


. Both Hardware prefetch and second load unit may not be enabled at the same time. Enabling both may cause incorrect program behavior. 


4. Watchpoint exceptions on Partial Store instruction occur conservatively. The DCUCR. VM masks are only checked for nonzero value 
(watchpoint disabled). The byte store mask (r[rs2]) in the Partial Store instruction is ignored, and a watchpoint exception can occur even if 
the mask is zero (that is, no store will take place). 
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6.10.2 
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Data Watchpoint Registers 


The UltraSPARC IIli processor implements “break before” watchpoint traps. When the 
address of a data access matches a preset physical or virtual watchpoint address, instruction 
execution is stopped immediately before the watched memory location is accessed. 

TABLE 6-27 lists ASIs that are affected by the two watchpoint traps. 


TABLE 6-27 ASIs Affected by Watchpoint Traps 


Watchpoint If Watchpoint If 
ASI Type ASI Range Matching VA Matching PA 


Translating ASIs 0416 -11 16> 1846-1916. 2416 = On Y Y 
2C 165 Off N Y 
7016-7116 7816-79 16 8016- 
— N Y 
— N N 


Bypass ASIs 1416-1516. ICj;s-1Djg6 





Non-translating ASIs 3016- 6F 16, 7216-7716, 7A1i6- 
TF 16 














For 128-bit (quad) atomic load and 64-byte block load and store instructions, a watchpoint 
trap is generated only if the watchpoint overlaps the lowest-address eight bytes of the access. 


To avoid trapping infinitely, software should emulate the instruction that caused the trap and 
return from the trap by using a DONE instruction or turn off the watchpoint before returning 
from a watchpoint trap handler. 





Two 64-bit data watchpoint registers provide the means to monitor data accesses during 
program execution. When Virtual/Physical Data Watchpoint is enabled, the virtual/physical 
addresses of all data references are compared against the content of the corresponding 
watchpoint register. If a match occurs, a VA_watchpoint or PA_watchpoint trap is signalled 
before the data reference instruction is completed. The virtual address watchpoint trap has 
higher priority than the physical address watchpoint trap. 


Separate 8-bit byte masks allow watchpoints to be set for a range of addresses. Each zero bit 
in the byte mask causes the comparison to ignore the corresponding byte in the address. 
These watchpoint byte masks and the watchpoint enable bits reside in the DCUCR. 


Virtual Address Data Watchpoint Register 
ASI 5816. VA = 3816 
Name: VA Data Watchpoint Register 


FIGURE 6-32 illustrates the Virtual Address Watchpoint Register. DB_VA is the most 
significant 61 bits of the 64-bit virtual data watchpoint address. 
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DB_VA E 
63 32 0 








FIGURE 6-32 VÀ Data Watchpoint Register Format 


Physical Address Data Watchpoint Register 
ASI 5816. VA-4016 
Name: PA Data Watchpoint Register 


FIGURE 6-33 illustrates the PA Data Watchpoint Register. DB PA is the most significant 61 
bits of the physical data watchpoint address. The width of an UltraSPARC IIIi processor 
physical address is 43 bits. 


FIGURE 6-33 PA Data Watchpoint Register Format 





Compatibility Note — The UltraSPARC Illi processor supports a 43-bit physical address 
space. Software is responsible for writing a zero-extended 64-bit address into the PA Data 
Watchpoint register. 





Data Watchpoint Reliability 


The processor supports watchpoint comparison on the MS (memory) pipeline; any 
second-issue (Ax pipeline) floating-point loads will not trigger a watchpoint. For reliable use 
of the watchpoint mechanism, the second floating-point load feature must be disabled using 
DCUCR.SL. 
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CHAPTER T 





Instruction Types 





Instructions are accessed by the processor from memory and are executed, annulled, or 
trapped. Instructions are discussed in seven general categories. The processor instructions are 
described in the following sections: 


Learning the Instructions 
Section 7.1, “Introduction” 
Section 7.2, “Memory Addressing for Load and Store Instructions” 
Section 7.3, “Integer Execution Environment” 
Section 7.4, “Floating-Point Execution Environment” 
Section 7.5, “VIS Execution Environment” 
Section 7.6, “Data Coherency Instructions” 
Section 7.7, “Register Window Management Instructions” 
Section 7.8, “Program Control Transfer Instructions” 


Section 7.9, “Prefetch Instructions” 


Reference Sections 
Section 7.10, “Instruction Summary Table by Category” 
Section 7.10.5, “Integer Execution Environment Instructions” 
Section 7.10.6, “Floating-Point Execution Environment Instructions” 
Section 7.10.7, “VIS Execution Environment Instructions” 
Section 7.10.8, “Data Coherency Instructions” 
Section 7.10.9, “Register-window Management Instructions” 
Section 7.10.10, “Program Control Transfer Instructions” 


Section 7.10.11, “Data Prefetch Instructions” 
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Section 7.11, “Instruction Formats and Fields” 
Section 7.12, “Reserved Opcodes and Instruction Fields” 
Section 7.13, “Big/Little-Endian Addressing” 





7.1 


Introduction 


The processor’s RISC architecture is defined primarily by the SPARC-V9 architecture. The 
UltraSPARC II processors were the first to extend the SPARC-V9 architecture with new 
instructions and additional logic units. The UltraSPARC IIi processor further extends this 
instruction execution environment. 


The UltraSPARC IIli processor provides backward compatibility for SPARC application 
programs. Upgraded system software is required. Noteworthy enhancements to the processor 
include greater capability in the execution units to improve instruction scheduling, new VIS 
instructions to reduce the length of code sequences, and data prefetch instructions to provide 
the compiler with ways to improve cache hit rates. 


Our compiler and other software development tools take advantage of the new instruction 
features to increase parallel execution, reduce code size, and achieve shorter instruction 
execution latencies. 





Tez 
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Memory Addressing for Load and Store 
Instructions 


SPARC-V9 uses big-endian byte order by default; the address of a quadword, doubleword, 
word, or halfword is the address of its most significant byte. Increasing the address means 
decreasing the significance of the unit being accessed. All instruction accesses are performed 
using big-endian byte order. SPARC-V9 also can support little-endian byte order for data 
accesses only; the address of a quadword, doubleword, word, or halfword is the address of its 
least significant byte. Increasing the address means increasing the significance of the unit 
being accessed. 
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Integer Unit Memory Alignment Requirements 


Halfword accesses are aligned on 2-byte boundaries; word accesses (which include 
instruction fetches) are aligned on 4-byte boundaries; extended-word and doubleword 
accesses are aligned on 8-byte boundaries. An improperly aligned address in a load, store, or 
load-store instruction causes a trap to occur, with possible exceptions. 





Programming Note — By setting i = 1 and rs1 =0, you can access any location in the 
lowest or highest 4 KB of an address space without using a register to hold part of the 
address. 





FP/VIS Memory Alignment Requirements 


Extended word and doubleword (64-bit) accesses must be aligned on 8-byte boundaries, 
quadword accesses must be aligned on 16-byte boundaries, and block load (BLD) and block 
store (BST) accesses must be aligned on 64-byte boundaries. 


All references are 32, 64, or 128 bits. They must be naturally aligned to their data width in 
memory except for double-precision floating-point (FP) values, which may be aligned on 
word boundaries. However, if so aligned, doubleword loads/stores may not be used to access 
them, resulting in less efficient and nonatomic accesses. 


An improperly aligned address in a load, store, or load-store instruction causes a 
mem_address_not_aligned exception to occur, with the following exceptions: 


An LDDF or LDDFA instruction accessing an address that is word aligned but not 
doubleword aligned causes an LDDF_mem_address_not_aligned exception. 








An STDF or STDFA instruction accessing an address that is word aligned but not 
doubleword aligned causes an STDF. mem address not aligned exception. 





Byte Order Addressing Conventions (Endianess) 


The processor uses big-endian byte order for all instruction accesses and, by default, for data 
accesses. It is possible to access data in little-endian format by using load and store alternate 
instructions that support little-endian data structures. It is also possible to change the default 
byte order for implicit data accesses. 


See Section 7.13, “Big/Little-Endian Addressing" for details. 


Chapter 7 Instruction Types 137 


7.2.4 
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Address Space Identifiers (ASIs) 


Versions of load/store instructions, the load and store alternate instructions, can specify an 
8-bit address space identifier (ASI) to go along with the load/store data instruction. 
The load and store alternate instructions have the following three sources of ASIs: 
Explicit immediate of instruction 
ASI Register reference 
Hardcode to the instruction 
Supervisor software (privileged mode) uses ASIs to access special, protected registers, such 


as MMU, cache control, and processor state registers, and other processor- or system- 
dependent values. 


ASIs are also used to modify the function of many instructions. This overloading of 
load/store instructions provide partial store, block load/store, and atomic memory access 
operations. 


Implicit ASI Value 


Load and store instructions provide an implicit ASI value of ASI_PRIMARY, 

ASI PRIMARY LITTLE, ASI NUCLEUS, or ASI NUCLEUS, LITTLE. Load and store 
alternate instructions provide an explicit ASI, specified by the imm asi instruction field 
when i = 0, or the contents of the ASI register when i = 1. 




















Privileged and Non-Privileged ASIs 


ASIs 00,6 through 7F,g are restricted; only privileged software is allowed to access them. An 
attempt to access a restricted ASI by non-privileged software results in a privileged action 
exception. ASIs 8046 through FF, are unrestricted; software is allowed to access them 
whether the processor is operating in privileged or non-privileged mode. 


Compatibility Note — The SPARC-V9 architecture provides the basic framework and 
defines the required ASIs for the processor. Other ASIs are defined (and sometimes re- 
defined) for a specific processor or family of processors as allowed by the SPARC-V9 
architecture. 





Implementation Note — The processor decodes all eight bits of each ASI specifier. In 
addition, the processors redefine certain ASIs as appropriate for a specific processor. 
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Maintaining Data Coherency 


The processor's memory architecture requires some software intervention to provide data 
coherency during program execution. These requirements are discussed in Chapter 8 
*Memory Models" using the FLUSH and MEMBAR instructions described in Section 7.6, 
*Data Coherency Instructions." 





The two types of data coherency instructions are needed to flush the cache for self-modifying 
code and to write data buffers out to memory. 





PM 
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7.3.1.1 


Integer Execution Environment 


IU Data Access Instructions 


Load, store, and atomic instructions are the only instructions that access memory. All the IU 
data access instructions, except the compare and store (CASx) use either two r registers or 
SIMM13, a signed 13-bit immediate value, to calculate a 64-bit, byte-aligned memory 
address. Compare and Swap uses a single r register to specify a 64-bit memory address. 
Floating-point register load and store instructions are discussed in Section 7.4.2, “FPU/VIS 
Data Access Instructions.” 


The processor appends an ASI to the 64-bit address used with all the data access instructions. 


Note — In addition to the large physical main memory, the processor has many memory 
mapped control, status, and diagnostic registers that are accessed using load and store 
instructions with an appropriate ASI value. 


The destination field of the data access instruction specifies an r or £ (single, double/ 
extended, or quadword) register that supplies the data for a store or that receives the data 
from a load. 


Load and Store Instructions 


Integer load and store instructions support byte, halfword (16-bit), word (32-bit), and 
doubleword (64-bit) accesses. Some versions of integer load instructions perform sign 
extension on 8-, 16-, and 32-bit values as they are loaded into a 64-bit destination register. 
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7.3.1.2 


7.3.1.3 


140 


Move Instruction 


There is no explicit integer move instruction. A move instruction can be easily synthesized 
by adding, subtracting or OR-ing a zero with a register and pointing the result to another 
register. The zero can come as a register input (such as $r0 that has a value zero in 
SPARC-V9) or as an immediate input to the instruction. 


Conditional Move Instructions 


Based on Integer (icc/xcc) and Floating-Point (fcc) Condition Codes 


This subsection describes two instructions that copy the contents of one register to another 
register within the same register file: one instruction for moving within the integer register 
file and another for moving within the floating-point register file. 


+ MOVcc Instruction 

If a specified icc/xcc or fcc condition is satisfied, then the MOVcc instruction copies the 
contents of any integer to a destination integer register. 

- FMOVcc Instruction 


If a specified icc/xcc or fcc condition is satisfied, then the FMOVcc instruction copies the 
contents of any floating-point register to a destination floating-point register. 


(A similar set of conditional move instructions are based on an integer register value. These 
conditional move instructions are described in Section 7.4, “Floating-Point Execution 
Environment.") 


The condition code to test is specified in the instruction and may be any of the conditions 
allowed in conditional delayed control transfer instructions. This condition is tested against 
1 of the 6 sets of condition codes (icc, xcc, fcc0, fcc1, fcc2, and fcc3), as specified 
by the instruction. 


For example: 


fmovdg $fcc2, $f20, $f22 


moves the contents of the double-precision floating-point register £20 to register £22 if 
floating-point condition code number 2 (£cc2) indicates a greater-than relation 
(FSR.fcc2-2).If £cc2 does not indicate a greater-than relation (FSR. £cc2 + 2), then 
the move is not performed. 


The MOVcc and FMOVcc instructions can be used to eliminate some branches in programs. 
In most situations, branches will take more clock cycles than the MOVcc or FMOVcc 
instructions. 


For example, the following C statement: 
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if (A > B) X = 1; else X = 0 
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can be coded as 


cmp $i0, $i2 ! (A > B) 
Or £g0, O, $i3 ! set X = 0 
movg $xcc, $g0,1, $i3 ! overwrite X with 1 if A > B 


which eliminates the need for a branch. 


Based on Integer Register Value 


There are separate versions for the IU and floating-point unit (FPU) register files: 


*  MOVr Instruction 


If the contents of an integer register satisfy a specified condition, then the MOVr instruction 
copies the contents of any integer register to a destination integer register. 


+ FMOVr Instruction 


If the contents of an integer register satisfy a specified condition, then the FMOVr instruction 
copies the contents of any floating-point register to a destination floating-point register. 


The conditions to test are enumerated in TABLE 7-1. 


TABLE 7-1 MOVr and FMOVr Test Conditions 




















Any of the integer registers may be tested for one of the conditions, and the result used to 
control the move. For example, 


movrnz $i2, %14, %16 


moves integer register $14 to integer register $16 if integer register 2i 2 contains a nonzero 
value. 


MOVr and FMOVr can be used to eliminate some branches in programs or to emulate 
multiple unsigned condition codes by using an integer register to hold the result of a 
comparison. 
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Atomic Instructions 


CASA/CASXA, SWAP, and LDSTUB are special atomic memory access instructions that 
concurrent processes use for synchronization and memory updates. 


The SWAP and LDSTUB instructions can optionally access alternate space. (The CASA 
instruction always accesses alternate memory spaces.) If the ASI specified for any alternate 
form of these instructions is a privileged ASI (value 806), then the processor must be in 
privileged mode to access it. 


Atomic Quad Load Instruction (LDDA with ASI xx) 


The atomic quad load instruction supplies an indivisible quadword (16-byte) load that is 
important in system software programs. 


Compare and Swap Atomic Instruction (CASA) 


An r register specifies the value that is compared with the value in memory at the computed 
address. CASA accesses words, and CASXA accesses doublewords. 


If the values are equal (memory location and r register), then the destination field specifies 
the r register that is to be exchanged atomically with the addressed memory location. 


If the values are unequal, then the destination field specifies the r register that was to receive 
the value at the addressed memory location; in this case, the addressed memory location 
remains unchanged. 


Swap Atomic Instruction (S WAP? ) 


The destination register identifies the r register to be exchanged atomically with the 
calculated memory location. SWAP accesses words. 


Load-Store Unsigned Byte (LDSTUB) 


The LDSTUB instruction reads a byte from memory and writes ones to the location read. 
LDSTUB accesses bytes. 
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IU Arithmetic Instructions 


The integer arithmetic instructions are generally triadic-register-address instructions that 
compute a result of a function of two source operands. They either write the result into the 
destination register r [rd] or discard it. One of the source operands is always r [rs1]. The 
other source operand depends on the i bit in the instruction. If i — 0, then the operand is 
r[rs2]. If i=1, then the operand is the immediate constant simm10, simm11, or 
simm13 sign-extended to 64 bits. 


The arithmetic/logical/shift instructions perform arithmetic, tagged arithmetic, logical, and 
shift operations. One exception is the SETHI instruction that can be used in combination with 
another arithmetic or logical instruction to create a 32-bit constant in an r register. 





Condition Codes 


Most integer arithmetic instructions have two versions: one sets the integer condition codes 
(icc and xcc) as a side-effect; the other does not affect the condition codes. 


Integer Add and Subtract Instructions 


Sixty-four bit arithmetic is performed on two r registers to generate a 64-bit result. The icc 
and xcc condition codes can be optionally set. 


Tagged Integer Add and Subtract Instructions 


The tagged arithmetic instructions assume that the least-significant two bits of each operand 
are a data-type tag. These instructions set the integer condition code (icc) and extended 
integer condition code (xcc) overflow bits on 32-bit (icc) or 64-bit (xcc) arithmetic 
overflow. 


The tagged instructions are described in Appendix A "Instruction Definitions." 


If either of the two operands has a nonzero tag or if 32-bit arithmetic overflow occurs, tag 
overflow is detected. If tag overflow occurs, then TADDcc and TSUBcc set the CCR. icc.V 
bit; if 64-bit arithmetic overflow occurs, then they set the CCR. xcc . V bit. 


The xcc overflow bit is not affected by the tag bits. 





The trapping versions (TADDccTV, TSUBccTV) are deprecated. See Section A.70.16, 
“Tagged Add and Trap on Overflow" and Section A.70.17, “Tagged Subtract and Trap on 
Overflow" for details. 
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Integer Multiply and Divide Instructions 


The integer multiply instruction performs a 64 x 64 — 64-bit operation; the integer divide 
instructions perform 64 + 64 — 64-bit operations. For compatibility with SPARC-V8, 

32 x 32 — 64-bit multiply instructions, 64 + 32 — 32-bit divide instructions, and the 
multiply step instruction are provided. Division by zero causes a division_by_zero exception. 


Some versions of the 32-bit multiply and divide instructions set the condition codes. 


Set High 22 Bits of Low Word 





The “set high 22 bits of low word of an r register” instruction (SETHI) writes a 22-bit 
constant from the instruction into bits 31 through 10 of the destination register. It clears the 
low-order 10 bits and high-order 32 bits, and it does not affect the condition codes. It is 
primarily used to construct constants in registers. 


Integer Shift Instructions 


Shift logical instructions (SLL, SRL) shift an r register left or right by an immediate 
constant in the instruction or by the amount pre-loaded in an r register. 


IU Logic Instructions 


ADD, ANDN, OR, ORN, XOR, XNOR Instructions 


These are standard logic operations that work on all 64 bits of the register. The instructions 
can optionally set the integer condition codes (icc/xcc). 


IU Compare Instructions 


A special comparison instruction for integer values is not needed since it is easily 
synthesized with the “subtract and set condition codes” (SUBcc) instruction. 
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IU Miscellaneous Instructions 


Interval Arithmetic Mode Instruction (SIAM) (VIS II) 


The Set Interval Arithmetic Mode (SIAM) instruction sets the interval arithmetic mode fields 
in the graphics status register (GSR). 


Align Address Instruction 


The ALIGNADDR instruction takes two r registers and adds them together. The three least 
significant bits are forced to zero. 





The ALIGNADDRL instruction supports little-endian data structures by taking the two 
r registers, adding them together, and placing the two’s-complement of the three least 
significant bits of the result and storing them in the 3-bit GSR. ALIGN field. 


Population of Ones Count 


A population opcode is defined but not implemented in hardware; instead, a trap is generated. 


Privileged Register Access Instructions 


The privileged register access instructions read and write another group of state and status 
registers called privileged registers. These registers are visible only to privileged software. 
The read privileged register instruction moves the privileged register contents into an 

r register. The write privileged register instruction moves the contents of an r register into 
the selected privileged register. 


State Register Access Instructions 


The state register instructions access program-visible state and status registers. The read state 
register instruction moves the state register contents into an r register. The write state 
register instruction moves the contents of an r register into the selected state register. 


Some state registers can only be accessed in privileged mode, others in either privileged or 
non-privileged mode. Some registers have access bits to restrict their availability as desired 
by the privileged software. 
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Floating-Point Execution Environment 


The floating-point and VIS execution unit includes the floating-point register file for floating- 
point and fixed-point data formats and the execution pipelines for floating-point and VIS 
instructions. 


This execution unit is a single unit that may be referred to any one of the following, 
depending on the textual context: 

Floating-point Unit (FPU) 

Floating-point and Graphics Unit (FGU) 

VIS Execution Unit (VIS) 

FPU/VIS 





Note — The instructions associated with the FPU/VIS execution unit are divided between 
floating-point and VIS execution environments, but otherwise use the same hardware 
pipelines. 


Floating-Point Operate Instructions 


Floating-point operate (FPop) instructions perform all floating-point calculations; they are 
register-to-register instructions that operate on the floating-point registers. Like arithmetic, 
logical, and shift instructions, FPops compute a result that is a function of one or two source 
operands. Specific floating-point operations are selected by a subfield of the FPop1/FPop2 
instruction formats. 


FPops are generally triadic-register-address instructions. They compute a result that is a 
function of one or two source operands and place the result in one or more destination 
f registers, with two exceptions: 


Floating-point convert operations, which use one source and one destination operand 
Floating-point compare operations, which do not write to an f register but update one of 
the fccn fields of the FSR instead 


The term “FPop” refers to those instructions encoded by the FPop1 and FPop2 opcodes and 
does not include branches based on the floating-point condition codes (FB£ccP and 
FBP fcc) or the load/store floating-point instructions. 














If PSTATE.PEF =0 or FPRS.FEF = 0, then any instruction, including an FPop instruction, 
that attempts to access a FPU register generates a fp disabled exception. 
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All FPop instructions clear the £tt field and set the cexc field unless they generate an 
exception. Floating-point compare instructions also write one of the £ccn fields. All FPop 
instructions that can generate IEEE exceptions set the cexc and aexc fields unless they 
generate an exception. FABS(s,d,q), FMOV(s,d,q), FMOVcc(s,d,q), FMOVr(s,d,q), and 
FNEG(s,d,q) cannot generate IEEE exceptions; therefore, they clear cexc and leave aexc 
unchanged. 





Note — The processor may indicate that a floating-point instruction did not produce a 
correct IEEE Standard 754-1985 result by generating a fp. exception other exception with 
FSR.ftt = unfinished FPop or unimplemented FPop. In this case, privileged software must 
emulate any functionality not present in the hardware. 


The processor does not implement quad-precision floating-point operations in hardware. 
Instead, these operations cause a fp. exception other trap with 
FSR.ftt — unimplemented FPop, and the system software emulates quad operations. 


FPU/VIS Data Access Instructions 


Floating-point load and store instructions support word, doubleword, and quadword memory 
accesses. 


There are no move instructions to move data directly between the integer and floating-point 
register files. 


Load Instructions 


Byte, halfword, word, and double/extended word data widths are supported with access to 
alternate address spaces. Data loaded into a register that is not 64 bits is filled with zeroes in 
the high-order bits. 


Store Instructions 


Byte, halfword, word, and double/extended word data widths are supported with access to 
alternate address spaces. 
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Block Load and Store Instructions 


Block load and store access eight consecutive doublewords. The LDDFA instruction is used 
with the various ASIs to specify a type of block transaction. The LDDFA instruction is 
specified with ASIs 70, 71, 78, 79, FO, F1, F8, F9, E0, and E1 to select between primary and 
secondary D-MMU contexts, little- and big-endian, privileged and non-privileged, and a set 
of block commit store ASIs. 


Conditional Move Instructions 


The FP/VIS conditional move instructions are described with the IU conditional move 
instructions, Section 7.3.1.3. 


Floating-Point Arithmetic Instructions 


Single-precision and double-precision FP is executed in hardware. Quad precision (128-bit) 
instructions are recognized by the processor and trapped so they can be emulated in software. 


Absolute Value and Negate Instructions 


These instructions modify the sign of the floating-point operand. 


Add and Subtract Instructions 


These instructions use standard IEEE operation. 


Multiply Instructions 


These instructions use standard IEEE operation with some exceptions. 


Square Root and Divide Instructions 
The square root and divide instructions begin their execution in the FGM pipeline and block 


new instructions from entering until the result is nearly ready to leave the pipeline and be 
written to the register file. 
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Floating-Point Conversion Instructions 


The following FP conversions are supported. Conversions do not generate fcc condition 
codes. 


Floating-Point to Integer 


All floating-point precision to word and double/extended word integer conversions are 
supported. 


Integer to Floating-Point 


Word and double/extended word integer to all floating-point precision number conversions 
are supported. 


Floating-Point to Floating-Point 


All floating-point precision to all floating-point precision number conversions are supported. 


Floating-Point Compare Instructions 


The same precision operands are compared and the fcc condition codes are set. 


Floating-Point Miscellaneous Instructions 


Load and Store FSR Register 


The FSR register is accessed by load and store instructions into and out of the floating-point 
register file. 


Data Alignment Instruction 


The data alignment instruction FALIGNDATA concatenates two registers (16 bytes) and 
stores a contiguous block of eight of these bytes starting at the offset stored in the 
GSR.ALIGN field. 
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VIS Execution Environment 


The floating-point and VIS execution unit includes the floating-point register file for floating- 
point and fixed-point data formats and the execution pipelines for floating-point and VIS 
instructions. 


This execution unit is a single unit that may be referred to any one of the following, 
depending on the textual context: 

Floating-point Unit (FPU) 

Floating-point and Graphics Unit (FGU) 

VIS Execution Unit (VIS) 

FPU/VIS 





Note — The instructions associated with the FPU/VIS execution unit are divided between 
floating-point and VIS execution environments, but otherwise use the same hardware 
pipelines. 


VIS Pixel Data Instructions 


Array Instruction 


These instructions convert three-dimensional (3D) fixed-point addresses to a blocked-byte 
address. 


Byte Mask and Shuffle Instructions 


Byte Mask instruction adds two integer registers and stores the result in the integer register. 
The least significant 32 bits of the result are stored in a special field. 


Byte Shuffle concatenates the two 64-bit floating-point registers to form a 16-byte value. 
Bytes in the concatenated value are numbered from most significant to least significant, with 
the most significant byte being byte 0. 
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Edge Handling Instructions 


These instructions handle the boundary conditions for parallel pixel scan line loops, where 
the address of the next pixel to render and the address of the last pixel in the scan line are 
provided. 


Pixel Packing Instructions 


These instructions convert multiple values in a source register to a lower-precision fixed or 
pixel format and store the resulting values in the destination register. Input values are clipped 
to the dynamic range of the output format. Packing applies a scale factor to allow flexible 
positioning of the binary point. 


Expand and Merge Instructions 


Expand takes four 8-bit unsigned integers, converts each integer to a 16-bit fixed-point value, 
and stores the four resulting 16-bit values in a 64-bit floating-point register. 


Merge interleaves four corresponding 8-bit unsigned values to produce a 64-bit value in the 
64-bit floating-point destination register. This instruction converts from packed to planar 
representation when it is applied twice in succession. 


Pixel Distance Instruction 


Eight unsigned 8-bit values are contained in the 64-bit floating-point source registers. The 
corresponding 8-bit values in the source registers are subtracted. The sum of the absolute 
value of each difference is added to the integer in the 64-bit floating-point destination 
register. The result is stored in the destination register. Typically, this instruction is used for 
motion estimation in video compression algorithms. 


VIS Fixed-Point 16-bit and 32-bit Data Instructions 


Partitioned Add and Subtract Instructions 


The standard versions of these instructions perform four 16-bit or two 32-bit partitioned adds 
or subtracts between the corresponding fixed-point values contained in the source operands. 


The single-precision versions of these instructions perform two 16-bit or one 32-bit 
partitioned add(s) or subtract(s); only the low 32 bits of the destination register are affected. 
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Partitioned Multiply Instructions 


These instructions multiply signed and unsigned registers of different sizes and place the 
results in different types of destination registers. 


Pixel Compare Instruction 


Either four 16-bit or two 32-bit fixed-point values in the 64-bit floating-point source registers 
are compared. The 4-bit or 2-bit results are stored in the least significant bits in the integer 
destination register. Signed comparisons are used. 


VIS Logic Instructions 


Fill with Ones and Zeroes Instruction 


These instructions perform a zero fill or a one fill. 


Source Copy 


These instructions perform a source copy. 


AND, OR, NAND, NOR, and XNOR Instructions 


These instructions perform the logical operations. 
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Data Coherency Instructions 


The processor implements a Total Store Ordering (TSO) that provides the majority of data 
coherency support in hardware. Two instructions are used with this model to synchronize the 
data for memory operations to insure the latest data is accessed for load instructions and 
DMA activity. 


Chapter 8 “Memory Models” discusses TSO in detail. 
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FLUSH Instruction Cache Instruction 





The FLUSH instruction is used to flush the caches out to main memory. The MEMBAR 
instruction is used to flush the various data buffers in the processor out to data coherent 
domain. 


Self-modifying code (storable in the unified L2-cache) requires the use of the FLUSH 
instruction. 


Note — The FLUSHW instruction flushes the Window-registers and is not related to the 
FLUSH command for the I-cache. 


MEMBAR (Memory Synchronization) Instruction 


Two forms of memory barrier (MI 
and completion of memory references. Ordering MI 





EMBAR) instructions allow programs to manage the order 





EMBAR instructions induce a partial 





ordering between sets of loads and stores and future loads and stores. Sequencing MEMBAR 
instructions exert explicit control over completion of loads and stores (or other instructions). 
Both barrier forms are encoded in a single instruction, with subfunctions bit-encoded in an 
immediate field. 


Store Barrier Instruction 


Note — STBAR’ is also supported, but this instruction is deprecated and should not be used 


in newly developed software. 
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Register Window Management Instructions 











ESTORE are non- 





Register window instructions manage the register windows. SAVE and R 


privileged and cause a register window to be pushed or popped. | 





FLUSHW is non-privileged 





and causes all of the windows except the current one to be flushed to memory. SAVED and 





RES 


OR 











RES 





OR 





EP and FLUSHW. 
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The instructions that manage register windows include SAV 


Instruction Types 


ED are used by privileged software to end a window spill or fill trap handler. 








EE 





ESTORE 





, SAVED, 
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SAVE Instruction 





The SAVE instruction allocates a new register window and saves the caller’s register window 
by incrementing the CWP register. 


RESTORE Instruction 











The RESTORE instruction restores the previous register window by decrementing the CWP 
register. 





SAVED” Instruction 





The SAVED instruction is used by a spill trap handler to indicate that a window spill has 
completed successfully. It increments CANSAVE. 





RESTORED? Instruction 











The RESTORED instruction is used by a fill trap handler to indicate that a window has been 
filled successfully. It increments CANRESTORE. 

















Flush Register Windows Instruction 


The FLUSHW instruction cleans register windows of the data from other processes to insure a 
secure execution environment. 
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Program Control Transfer Instructions 


Control transfer instructions (CTIs) include PC-relative branches and calls, register-indirect 
jumps, and conditional traps. Most of the CTIs are delayed; that is, the instruction 
immediately following a CTI in logical sequence is dispatched before the control transfer to 
the target address is completed. Note that the next instruction in logical sequence may not be 
the instruction following the CTI in memory. 


The instruction following a delayed CTI is called a delay instruction. A bit in a delayed CTI 
(the annul bit) can cause the delay instruction to be annulled (that is, to have no effect) if the 
branch is not taken (or in the “branch always” case if the branch is taken). 
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Compatibility Note — SPARC V8 specified that the delay instruction was always fetched, 
even if annulled, and an annulled instruction could not cause any traps. SPARC-V9 does not 
require the delay instruction to be fetched if it is annulled. 


Branch and CALL instructions use PC-relative displacements. The jump and link (JMPL) and 
return (RETURN) instructions use a register-indirect target address. They compute their target 
addresses either as the sum of two r registers or as the sum of an r register and a 13-bit 
signed immediate value. The “branch on condition codes without prediction" instruction 
provides a displacement of +8 MB; the “branch on condition codes with prediction" 
instruction provides a displacement of +1 MB; the “branch on register contents" instruction 
provides a displacement of +128 KB; and the CALL instruction's 30-bit word displacement 
allows a control transfer to any address within +2 GB (+2?! bytes). 

















Note — The return from privileged trap instructions (DONE and RETRY) get their target 
address from the appropriate TPC or TNPC register. 





Control Transfer Instructions (CTIs) 


The following are the basic CTI types: 
Conditional branch (Bicc”, BPcc, BPr, FB£ccP, FBPfcc) 
Unconditional branch 
Call and link (CALL) 
Jump and link (JMPL, RETURN) 














Return from trap (DONEF, RETRY?) 





Trap (Tcc, ILLTRAP) 
No Operation (NOP, SIR when in non-privileged mode) 


A CTI functions by changing the value of the next program counter (nPC) or by changing 
the value of both the program counter (PC) and the nPC. When only the next program 
counter, nPC, is changed, the effect of the transfer of control is delayed by one instruction. 
Most control transfers are of the delayed variety. The instruction following a delayed CTI is 
said to be in the delay slot of the CTI. Some CTI (branches) can be optionally annul, that is, 
not execute, the instruction in the delay slot, depending upon whether the transfer is taken or 
not taken. Annulled instructions have no effect upon the program-visible state, nor can they 
cause a trap. 
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Programming Note — The annul bit increases the likelihood that a compiler can find a 
useful instruction to fill the delay slot after a branch, thereby reducing the number of 
instructions executed by a program. For example, the annul bit can be used to move an 
instruction from within a loop to fill the delay slot of the branch that closes the loop. 


Likewise, the annul bit can be used to move an instruction from either the “else” or “then” 
branch of an “if-then-else” program block to the delay slot of the branch that selects between 
them. Since a full set of conditions is provided, a compiler can arrange the code (possibly 
reversing the sense of the condition) so that an instruction from either the “else” branch or 
the “then” branch can be moved to the delay slot. 


Use of annulled branches provided some benefit in older, single-issue SPARC 
implementations. The UltraSPARC Illi processor is a superscalar SPARC implementation in 
which the only benefit of annulled branches might be a slight reduction in code size. 
Therefore, the use of annulled branch instructions is no longer encouraged. 


TABLE 7-2 defines the value of the PC and the value of the nPC after execution of each 
instruction. Conditional branches have two forms: branches that test a condition (including 
branch-on-register), represented in the table by Bcc (same as Bicc), and branches that are 
unconditional, that is, always or never taken, represented in the table by B. The effect of an 
annulled branch is shown in the table through explicit transfers of control, rather than 
fetching and annulling the instruction. 
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Instruction Group Address Form Delayed | Taken Annul Bit | New PC New nPC 
Non-CTIs — ae nPC nPC+4 
Bcc PC-relative 0 nPC EA 

Bcc PC-relative 0 nPC nPC+4 
Bcc PC-relative 1 nPC EA 

Bcc PC-relative 1 nPc+4 nPC+8 

B PC-relative 0 nPC EA 

B PC-relative 0 nPC nPC+4 

B PC-relative 1 EA EA 4 

B PC-relative Yes nPc+4 nPC+8 
CALL PC-relative nPC EA 

JMPL, RETURN Register-indirect —— nPC EA 

DONE Trap state TNPC[TL] TNPC[TL] +4 
RETRY Trap state —— TPC[TL] TNPC[TL] 
Tcc Trap vector EA EA+4 
Tcc Trap vector — nPC nPC +4 


The effective address (1 


The 


- PC-relative effective address — A PC-relative 1 














EA is computed by sign extending the 


EA) in TABLE 7-2 specifies the target of the control transfer instruction. 
EA is computed in different ways, depending on the particular instruction: 


instruction’s immediate field to 64 bits, left-shifting the word displacement by two bits to 
create a byte displacement, and adding the result to the contents of the PC. 


- Register-indirect effective address — A register-indirect I 
as either r[rs1] + r[rs2] ifi=0, orr[rsl]+sig 


- Trap vector effective address — A trap vector 
number as the least significant 7 bits of r[rs1 








EA first computes the software trap 
*tr[rs2] if 


EA computes its target address 
n ext(simml3)ifi-rl. 


i =0, or as the least significant 7 bits of r[rs1] + sw_trap# if i = 1. The trap level, 





TT[TL]. The! 





“TL > 0" bit, and the contents of TT [TL]. 


- Trap state effective address — A trap state | 
either TPC[TL] or TNPC[TL]. 
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TL, is incremented. The hardware trap type is computed as 256 + sw traps and stored in 
EA is generated by concatenation of the contents of the TBA register, the 


EA is not computed but is taken directly from 
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Compatibility Note — SPARC-V8 specified that the delay instruction was always fetched, 


even if annulled, and that an annulled instruction could not cause any traps. SPARC-V9 does 
not require the delay instruction to be fetched if it is annulled. 


SPARC V8 left undefined the result of executing a delayed conditional branch that had a 
delayed control transfer in its delay slot. For this reason, programmers should avoid such 
constructs when backward compatibility is an issue. 


Conditional Branches 


A conditional branch transfers control if the specified condition is true. If the annul bit is 
zero, the instruction in the delay slot is always executed. If the annul bit is one, the 
instruction in the delay slot is not executed unless the conditional branch is taken. 


Note — The annul behavior of a taken conditional branch is different from that of an 
unconditional branch. 


Unconditional Branches 


An unconditional branch transfers control unconditionally if its specified condition is 
“always”; it never transfers control if its specified condition is “never.” If the annul bit is 
zero, then the instruction in the delay slot is always executed. If the annul bit is one, then the 
instruction in the delay slot is never executed. 


Note — The annul behavior of an unconditional branch is different from that of a taken 
conditional branch. 


CALL/JMPL and RETURN Instructions 


CALL 


The CALL instruction writes the contents of the PC, which points to the CALL instruction 
itself, into r[15] (out register 7) and then causes a delayed transfer of control to a PC- 
relative effective address. The value written into r[15] is visible to the instruction in the 
delay slot. 
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7.8.1.4 


7.8.1.5 


When PSTATE.AM = 1, the value of the high-order 32 bits is transmitted to r[15] by the 
CALL instruction. 





Jump and Link 


The JMPL instruction writes the contents of the PC, which points to the JMPL instruction 
itself, into r [rd] and then causes a register-indirect delayed transfer of control to the 
address given by “r[rs1] + r[rs2]” or “r[rs1] + a signed immediate value.” The 
value written into r [rd] is visible to the instruction in the delay slot. 


When PSTATE.AM = 1, the value of the high-order 32 bits transmitted to x [rd] by the 
JMPL instruction is zero. 





RETURN 





The RETURN instruction is used to return from a trap handler executing in non-privileged 
mode. RETURN combines the control-transfer characteristics of a JMPL instruction with r[0] 
specified as the destination register and the register-window semantics of a RESTORE 
instruction. 

















DONE and RETRY Instructions 














The DONE and RETRY instructions are used by privileged software to return from a trap. 
These instructions restore the machine state to values saved in the TSTATE register. 











RETRY returns to the instruction that caused the trap in order to re-execute it. DONE returns 
to the instruction pointed to by the value of nPC associated with the instruction that caused 
the trap, that is, the next logical instruction in the program. DONE presumes that the trap 
handler did whatever was requested by the program and that execution should continue. 





Trap Instruction (Tcc) 


The Tcc instruction initiates a trap if the condition specified by its cond field matches the 
current state of the condition code register specified by its cc field; otherwise, it executes as 
a NOP. If the trap is taken, it increments the TL register, computes a trap type that is stored 
in TT[TL], and transfers to a computed address in the trap table pointed to by TBA. 


A Tcc instruction can specify 1 of 128 software trap types. When a Tcc is taken, 256 plus 
the seven least significant bits of the sum of the Tcc’s source operands is written to TT[TL]. 
The only visible difference between a software trap generated by a Tcc instruction and a 
hardware trap is the trap number in the TT register. 
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7.8.1.6 


7.8.1.7 


Programming Note — Tcc can be used to implement breakpointing, tracing, and calls to 
supervisor software. Tcc can also be used for runtime checks, such as out-of-range array 
index checks or integer overflow checks. 


ILLTRAP 


The ILLTRAP instruction causes an illegal instruction exception. 


NOP 


A NOP instruction occupies the entire (single) instruction group and performs no visible 
work. 
There are other instructions that also result in an operation that has no visible effect: 
SIR instruction executed in non-privileged mode 
SHUTDOWN instruction executed in privileged mode 


There are other instructions that appear to be a NOP as long as they do not affect the 
condition codes. 





LI 


Prefetch Instructions 


The prefetch instruction is used to request that data be fetched from memory and put into the 
cache(s) if not already there for use in the floating-point and VIS execution environment. A 
subsequent load, if properly scheduled, can expect the data to more likely be in the cache, 
reducing the number of times the pipeline must recycle and thus improving performance. 











The destination field of a PREFETCH instruction (fcn) is used to encode the prefetch type. 
The PREFETCHA instruction supports accesses to alternate space. 





























PREFETCH accesses at least 64 bytes. 





7.10 
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Instruction Summary Table by Category 


A summary of instructions are categorized in TABLE 7-3. 


UltraSPARC llli Processor User's Manual * June 2003 


7.10.1 Instruction Superscripts 





INSTRUCTIONP - Instruction must execute in privileged mode. 
INSTRUCTION - Instruction can execute in privileged or non-privileged mode. 
7.10.2 Instruction Mnemonics Expansion 


INSTRUCTION( Aj - means INSTRUCTION, INSTRUCTION A 





INSTRUCTION. (A,B,C) - means INSTRUCTION A, INSTRUCTION B, and 
INSTRUCTION C 


7.10.3 Instruction Grouping Rules 


Instruction grouping rules are explained in detail in Chapter 4 “Instruction Execution." 


Execution Latency 


All instructions execute within the pipeline except the following: 


FSORT (floating-point square root) 





FPDIVx (floating-point divide) 


The latency of these instructions depend on the precision of the floating-point values. Some 
instructions execute early in the pipeline and have special bypass abilities. The details of the 
execution latencies are explained in Chapter 4 “Instruction Execution.” 


7.10.4 Table Organization 


The Instruction Summary Table has the following main sections: 


Integer Execution Environment (TABLE 7-3) 
Data access, Arithmetic, Logic, Compare, Miscellaneous instructions 


Floating-point Execution Environment (TABLE 7-4) 
FP/VIS data access, FP arithmetic/logic/compare/miscellaneous 


VIS Execution Environment (TABLE 7-5) 
VIS pixel and fixed-point arithmetic/logic 


Data Coherency Instructions (TABLE 7-6) 


Register-window Management Instructions (TABLE 7-7) 
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Program Control Transfer Instructions (TABLE 7-8) 
Prefetch Instructions (TABLE 7-9) 


Shaded areas indicate instructions that are completely deprecated (entire row) or always 
privileged (cell holding instruction name). Deprecated and privilege status is identified with 
aP or” superscript, respectively. 
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7.10.5 Integer Execution Environment Instructions 


TABLE 7-3 Instruction Summary for the Integer Execution Environment (7 of 3) 










































































Instruction Description Notes 
Integer Execution Environment 
IU Data Access Instructions ASI Load 
B= byte; H- halfword; W=word; (hex) 
LDDD Load integer double word No 
LDDAD; PASI Load integer double word from alternate space 
LDDAPASI Atomic guad load 24,2C 
LDS(B,H,W) Load signed extended byte, halfword, or word: |No 
Memory — IU register 
LDX Load extended (double) word No 
LDXAPASI Load extended (double) word from alternate 
space 
LDS(B,H,w)APASI Load signed extended byte, halfword, or word 
from alternate space 
LDSTUB Load-store (atomic) unsigned byte: No 
Memory — IU register & Compare logic; 
IU register > Memory (conditional) 
LDSTUBAFASI Load-store (atomic) unsigned byte (see 
LDSTUB) in alternate space 
LDU(B,H,W) Load unsigned byte, halfword, word: Memory 
— IU register 
LDU(B,H,W)APASI Load unsigned byte, halfword, word from 
alternate space 
ST(B,H,W,D?,x) Store byte, halfword, word, double, or 
extended word: 
IU register Memory 
ST(B,H,W,D?,x)APASI Store byte, halfword, word, double, or 
extended word in alternate space 
MOVcc Conditional move based on icc/fcc: 1 
IU register — IU register 
MOVr Conditional move based on IU register value: 2 
IU register — IU register 
CASAFASI, CASXAPASI Atomic Compare and Swap word/double word 3,4,5 
in alternate space: 
Memory — Compare logic 
Memory © (conditional) Working register 
SWAPD (AD. PAST) Atomically swap optionally with alternate 
space: 
IU register Memory 
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TABLE 7-3 Instruction Summary for the Integer Execution Environment (2 of 3) 
Notes 


Instruction Description 


IU Arithmetic Instructions 
S= signed; U= unsigned; X= 64-bit (otherwise 32) 
Integer add 


Integer add with carry 














ADD{cc} 















ADDC {cc} 

SUB{cc} Integer subtract, optionally setting icc/xcc 

SUBC {cc} Integer subtract with carry, optionally setting 
icc/xcc 





Signed or unsigned 64-bit multiply 
Signed/unsigned integer multiply optionally 
setting icc/xcc 

Unsigned 64-bit integer divide 







MULX 
(S,U)MUL (cc]P 































UDIVX 
SDIVX Signed 64-bit integer divide 
(S,U)DIV{cc}? Signed/unsigned 32-bit integer divide, 
optionally setting icc/xcc 
SETHI Modify highest 22 bits of low word in IU 
register: 


Immediate — IU register (partial) 
Shift left logical (32/64-bit) 
Shift right logical (32/64-bit) 















Shift right arithmetic (32/64-bit) 








SRA{X} 
TADDcc(TVD) 





Tagged add and modify icc, optionally trap 
on overflow 
Tagged subtract and modify icc, optionally trap 








TSUBcc(TVP) â 
on overilow 





IU Logic Instructions 
AND{cc} 
ANDN {cc} 
OR (cc) 
ORN{cc} 
XOR {cc} 


XNOR{cc} 
IU Miscellaneous Instructions 





Logical AND, optionally setting icc/xcc 









Logical AND-not, optionally setting icc/xcc 











Logical OR, optionally setting icc/xcc 











Logical OR-not, optionally setting icc/xcc 





Logical XOR, optionally setting icc/xcc 









Logical XNOR, optionally setting icc/xcc 












SIAM 


ALIGNADDRESS( LITTLE] Calculates aligned address 












Defined to count the number of ones in 
register, unimplemented (causes an illegal 
instruction execution which traps to software 














for emulation) 
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TABLE 7-3 Instruction Summary for the Integer Execution Environment (3 of 3) 
Instruction Description Notes 
RDPRP Read privileged register 
WRPRP Write privileged register 
RDASRPASR Read ancillary state register (ASR) - see below. 


RDYP, RDCCR, RDASI, RDPC, RDFPRS, 
RDPCRP, RDPICPPCR.PRIV pp CR”, RDGSR, 
RDSOFTINTP, RDTICKPNPT, RpSTICKPNPT, 
RDTICK_CMPR?, RDSTICK_CMPR? 


WRASRPASR 


WRYP, WRCCR, WRASI, WRFPRS, WRPCR”, 
WRPICPPCR.PRIV WRDCR?, WRGSR, 
WRSOFTINTP, 

WRSOFTINT. CLRP, WRSOFTINT_SETP, 
WRSTICKPNPT, wRTICK, CMPRP, 
WRSTICK, CMPRP 





N = 


alternate space identifier (AST). 
3. The “r” refers to value in r registers. 


4. The cc refers to settings of the integer condition codes. 


Privileged mode required for privileged ASRs. 





Read state and ancillary state registers: 


- If PCR.PRIV field is one, then PIC register 
access requires privileged mode. 


- If {TICK|STICK}.NPT field is zero, then 
TICK/STICK register reads require privileged 
mode. 





Write ancillary state register (ASR); Privileged 
mode required for privileged ASRs. 





Read state and ancillary state registers: 


- If PCR.PRIV field is one, then PIC register 
access requires privileged mode. 


- If STICK.NPT field is zero, then STICK 
register writes require privileged mode. 


. A simple register-to-register move is accomplished by using the OR instruction with r [0]. 








. Load (LD) and store (ST) instructions are provided with many size formats (byte, word, double word, etc.) and most can be specified with an 


5. The conditional move instructions (integer and floating-point) are influenced by the condition codes of either execution unit to facilitate moves 
in one type of execution unit based on the condition codes of the other or of those within the execution unit. 
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7.10.6 
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Floating-Point Execution Environment Instructions 




























































































TABLE 7-4 Instruction Summary for the Floating-point Execution Environment 
Reference 
Instruction Description Pages Notes 
FP/VIS Data Access Instruction ASI Load 
s= 32-bit; d= 64-bit; q= 128-bit (q is trapped) (hex) 
LD{D}F Load word (or double word): No 
Memory — FPU register 
LD(D) FAPASI Load word (or double word) from 
alternate space: 
Memory — FPU register 
LDDFA Block load 64 bytes: 
Memory — FPU registers 
LDDFA Load short: 
Memory — FPU register 
LDOF Load guadword: No 
Memory — FPU register 
LDOFAPASI Load guadword from alternate space: No 
Memory — FPU register 
ST(F,DF,OF) Store word, double, or guad word to No 
memory: 
FPU register > Memory 
ST(E,DF,OF)APASI Store word, double, or guad word to 
memory using alternate memory space. 
STDFA Block store 64 bytes: uses ASIs 70, 71, 78, 79, 
FO, F1, F8, F9, 
EO, El 
STDFA Short FP store: uses ASIs D(0:3)16, 
D(8:B)16 
STDFA Partial store FPU: uses ASIs C(0:5) 16, 
C(8:D) ig 
FMOV6s,d,g) FPU — FPU register No 
FMOV(s,d,q)cc Conditional move, IU or FPU condition |No 
codes: 
FPU — FPU register 
FMOV(s,d,q)r Conditional move, IU or FPU register No 
value: FPU — FPU register 
FP Arithmetic Instructions 
s= 32-bit; d= 64-bit; q= 128-bit (q is trapped) 
FABS(s,d,q) FP absolute value 
FNEG(s,d,q) Change FP sign 
FADD\s,d,q) FP add 
FSUB6s,d,g) FP subtract 
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TABLE 7-4 Instruction Summary for the Floating-point Execution Environment (Continued) 

















Reference 
Instruction Description Pages Notes 
FMUL(s,d,q) FP multiply 
FdMULq FP multiple doubles to quadword 
FsMULd FP multiple singles to doubleword 
FDIV(s,d,q) FP division 
FSORT6s,d,g) FP sguare root 


FP Conversion Instructions 

















s= 32-bit; d= 64-bit; g— 128-bit (g is trapped); i= integer word; x= double 





(or extended) word 

















F(s,d,g)TOi Floating-point to integer word 
F(s,d,g)TOx Floating-point to integer double word 
F(s,d,q) TO(s,d,q) Floating-point to floating-point 
FiTO(s,d,q) Integer word to floating-point 
FxTO(s,d,q) Integer double (or extended) word to 


floating-point 





FP Compare Instructions 
FCMPr(s,d,q) 


FCMPE6s,d,g) 


FP Miscellaneous Instructions 


LDFSRD 


LDXFSR 


STFSRP 


STXFSR 


FALIGNDATA 
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FP compare of like precision, sets fcc 
condition codes 





Same as FCMP, but an exception is 
generated if unordered 








Load FSR into FP reg file: 
FSR — FPU register (lower 32-bit) 





Load FSR into FP reg file: 
FSR — FPU register (64-bit) 





Store FSR register: 
FPU (lower 32-bit) — FSR register 





Store FSR register: 
FPU — FSR register 








Concatenates two 64-bit registers into one 
based on GSR.ALIGN 
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7.10.7 


TABLE 7-5 Instruction Summary for the VIS Execution Environment 


VIS Execution Environment Instructions 





Instruction 


Description 








Reference 
Pages 





Notes 





VIS Data Access Instructions 





Refer to Section 7.10.6, “Floating-Point Execution Environment Instructions” of the Instruction Summary Table. 





VIS Pixel Data Instructions 


L- little-endian; N= fcc not modified; S= 32-bit (otherwise 64-bit); 


ARRAY(8,16,32) 
BMASK 





3D-array addressing 





Writes the GSR . MASK field 





BSHUFFLE 


Permute bytes as specified by GSR. MASK 
field. 





EDGE(8, 16,32) 
(L,N,LN) 


Edge handling instructions 





FEXPAND 


Pixel data expansion 





FPMERGE 
FPACK(16,32,FIX) 
PDIST 


Pixel merge 





Pixel packing 





Pixel component distance 





VIS Fixed-point 16/32-bit Data Instructions 





FPADD(16,32)(S) 


Fixed-point add, 16- or 32-bit operands, 
32/64-bit register 





FPSUB(16,32)(S) 


Fixed-point subtract, 16- or 32-bit 
operands, 32/64-bit register 





FMUL8x16 
FMULSx16(AU,AL) 
FMULS(SU,SL)x 16 


8x16 partitioned multiply 





8x16 Upper/Lower a partitioned multiply 





8x16 Upper/Lower partitioned multiply 





FMULDS(SU,SL)x16 


8x16 Upper/Lower partitioned multiply 





FCMP(GT,LE,NE,EQ)(16,32) 


Fixed-point compare (also known as 
“pixel compare”) 





VIS Logic Instructions 
S= 32-bit (otherwise 64-bit) 





FSRC(1,2)(S) 
FONE{S} 
FZERO{S} 


Copy source 





Fill with ones (32/64-bit) 





Fill with zeroes (32/64-bit) 





FAND{S} 


Logical AND (32/64-bit) 





FANDNOT(1,2)(S) 


Logical AND with a src inverted (32/64- 
bit) 





FOR{S} 
FNAND(S] 
FNOR{S} 





Logical OR (32/64-bit) 





Logical NAND (32/64-bit) 








Logical NOR (32/64-bit) 
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TABLE 7-5 


Instruction 


Instruction Summary for the VIS Execution Environment (Continued) 


Description 


Reference 
Pages 


Notes 





FORNOT(1,2)(S) 


Logical OR with a source inverted (32/ 
64-bit) 





FNOT(1,2)(S) 


Logical inversion of source bits (32/64- 
bit) 





FXNOR(S] 
FXOR{S} 
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Logical XNOR (32/64-bit) 





Logical XOR (32/64-bit) 
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7.10.8 Data Coherency Instructions 


TABLE 7-6 Instruction Summary for Data Coherency 









Reference 
Instruction Description Pages Notes 





Data Coherency Instructions 










Flush I-cache 
MEMBAR Memory barrier 








Store barrier 






























































7.10.9 Register-window Management Instructions 
TABLE 7-7 Instruction Summary for Register-window Management 
Reference 
Instruction Description Pages Notes 
Register-Window Management Instructions 
SAVE Save caller’s window 
SAVED? Window has been saved 
RESTORE Restore caller’s window 
RESTORED? Window has been restored 
FLUSHW Flush register windows 
7.10.10 Program Control Transfer Instructions 
TABLE 7-8 Instruction Summary for Program Control Transfer 
Reference 
Instruction Description Pages Notes 


Program Control Transfer Instructions 
icc/xcc= integer condition codes (32/64-bit); fec= FP condition codes 
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BiccP 


Conditional branch on icc/xcc 





BPcc 


Conditional branch on icc/xcc with 
branch prediction 





Conditional branch on IU reg value with 
branch prediction 








Call and link 





Return from Trap 
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TABLE 7-8 Instruction Summary for Program Control Transfer (Continued) 












































Reference 
Instruction Description Pages Notes 
FBfccP Conditional branch on fcc 
FBPfcc Conditional branch on fcc with branch 
prediction 
ILLTRAP Causes illegal_instruction trap 
JMPL Jump and link 
NOP No operation 
RETRY? Return from trap entry 
RETURN Return (jump and link) 
SHUTDOWN? Intended for Low Power, but is a NOP in 
the processor 
SIRPNOP Software initiated reset: a NOP when 
executed in non-privileged mode 
Tec Trap on icc/xcc 
7.10.11 Data Prefetch Instructions 
TABLE 7-9 Instruction Summary Table 
Reference 
Instruction Description Pages Notes 


Prefetch Instructions 








PREFETCH Instructs processor to fetch data 





PREFETCHAFASI Instructs processor to fetch data from 
alternate memory space 




















7.11 Instruction Formats and Fields 


Instructions are encoded in four major 32-bit formats and several minor formats, as shown in 
FIGURE 7-1, FIGURE 7-2, and FIGURE 7-3. 
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Format 1 (op = 1): CALL 


BB disp30 


31 30 29 0 





Format 2 (op = 0): SETHI and Branches (Bicc, BPcc, BPr, FBfcc, FBPfcc) 

el = T= 
eT L= [= 
[pm l= Fee 
Ee E RN 


31 30 29 28 25 24 22 21 20 19 18 14 13 0 





FIGURE 7-1. Summary of Instruction Formats: Formats 1 and 2 
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Format 3 (op = 2 or 3): Arithmetic, Logical, MOVr, MEMBAR, Prefetch, Load, and Store 


|" | 
el] = 
| op | fen op3 
ES fcn op3 
e T e 
EENE 


rs1 i=0| — rs2 


rs1 i=0| — rs2 





CDI > [" MH! |= 
ele] = |n Mel) 


ele | = 
e| e | = 
e e | = 
EICH: 
SII 
[I = 
SII 
| op | rd op3 


imm_asi rs2 





impl-dep 


rs1 i-0| x — rs2 


shent32 


n 
= 
a 
= 
x 
Il 
o 


shcnt64 





a 
zx 
T 
= 
x 
lì 


— opf rs2 





[DRE II x 


fcn op3 


| op | fen op3 


31 30 29 25 24 





19 18 


rs1 opf rs2 


14 13 12 11 10 9 7 6 5 4 3 


o 


FIGURE 7-2 Summary of Instruction Formats: Format 3 


Chapter 7 


Instruction Types 


173 


Format 4 (op = 2): MOVcc, FMOVr, FMOVcc, and Tcc 


op rd op3 rs1 0 rs2 


31 30 29 25 24 19 18 17 14 13 12 11 10 9 765 4 0 











FIGURE 7-3 Summary of Instruction Formats: Format 4 


The instruction fields are interpreted as described in TABLE 7-10. 


TABLE 7-10 Instruction Field Interpretation (7 of 3) 





Field Description 

a The a bit annuls the execution of the following instruction if the branch is conditional and not 
taken, or if it is unconditional and taken. 

cc2, ccl, ccd CC2, cc1, and cc0 specify the condition codes (icc, xcc, fcc0, fccl, fcc2, fcc3) to be 


used in the following instructions: 

* Branch on Floating-Point Condition Codes with Prediction Instructions (FBP fcc) 

* Branch on Integer Condition Codes with Prediction (BPcc) 

* Floating-Point Compare Instructions (FCMP and FCMPE) 

* Move Integer Register If Condition Is Satisfied (MOVcc) 

* Move Floating-Point Register If Condition Is Satisfied (FMOVcc) 

* Trap on Integer Condition Codes (Tcc) 

In instructions such as Tcc that do not contain the cc2 bit, the missing cc2 bit takes on a 
default value. 


cmask This 3-bit field specifies sequencing constraints on the order of memory references and the 
processing of instructions before and after a MEMBAR instruction. 

cond This 4-bit field selects the condition tested by a branch instruction. 

di6hi, dl161o These 2-bit and 14-bit fields together comprise a word-aligned, sign-extended, PC-relative 


displacement for a branch-on-register-contents with prediction (BPr) instruction. 


displ9 This 19-bit field is a word-aligned, sign-extended, PC-relative displacement for an integer 
branch-with-prediction (BPcc) instruction or a floating-point branch-with-prediction (FBPfcc) 
instruction. 
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TABLE 7-10 Instruction Field Interpretation (2 of 3) 
































Field Description 

disp22, disp30 These 22-bit and 30-bit fields are word-aligned, sign-extended, PC-relative displacements for a 
branch or call, respectively. 

fon This 5-bit field provides additional opcode bits to encode the DONE, RETRY, and PREFETCH(A) 
instructions. 

i The i bit selects the second operand for integer arithmetic and load/store instructions. If i = 0, 
then the operand is r[rs2]. If i =1, then the operand is simm10, simm11, or simm13, 
depending on the instruction, sign-extended to 64 bits. 

imm22 This 22-bit field is a constant that SETHI places in bits 31:10 of a destination register. 

imm_asi This 8-bit field is the ASI in instructions that access alternate space. 

mmask This 4-bit field imposes order constraints on memory references appearing before and after a 
MEMBAR instruction. 

op, op2 These 2-bit and 3-bit fields encode the three major formats and the Format 2 instructions. 

op3 This 6-bit field (together with one bit from op) encodes the Format 3 instructions. 

opf This 9-bit field encodes the operation for a floating-point operate (FPop) instruction. 

pf ec Specifies the condition codes to be used in FMOVcc instructions. See field cc0, cc1, and cc2 
for details. 

opf_low This 6-bit field encodes the specific operation for a Move Floating-Point Register if condition is 
satisfied (FMOVcc) or Move Floating-Point Register if contents of integer register match 
condition (FMOVr) instruction. 

p This 1-bit field encodes static prediction for BPcc and FBP fcc instructions; branch prediction 
bit (p) encodings are shown below. 

e | Branch Prediction 
0 Predict that branch will not be taken 
1 Predict that branch will be taken 

rcond This 3-bit field selects the register-contents condition to test for a move, based on register 
contents (MOVr or FMOVr) instruction or a Branch on Register Contents with Prediction (BPr) 
instruction. 

rd This 5-bit field is the address of the destination (or source) r or f register(s) for a load, 
arithmetic, or store instruction. 

rsl This 5-bit field is the address of the first r or f register(s) source operand. 

rs2 This 5-bit field is the address of the second r or f register(s) source operand with i = 0. 

shcnt32 This 5-bit field provides the shift count for 32-bit shift instructions. 

shcnt64 This 6-bit field provides the shift count for 64-bit shift instructions. 

simml0 This 10-bit field is an immediate value that is sign-extended to 64 bits and used as the second 
ALU operand for a MOVr instruction when i = 1. 

simmll This 11-bit field is an immediate value that is sign-extended to 64 bits and used as the second 
ALU operand for a MOVcc instruction when i = 1. 

simm13 This 13-bit field is an immediate value that is sign-extended to 64 bits and used as the second 





ALU operand for an integer arithmetic instruction or for a load/store instruction when i = 1. 
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TABLE 7-10 Instruction Field Interpretation (3 of 3) 


Description 





sw_trap# This 7-bit field is an immediate value that is used as the second ALU operand for a Trap on 
Condition Code instruction. 


x The x bit selects whether a 32-bit or 64-bit shift will be performed. 








1.12 Reserved Opcodes and Instruction Fields 


An attempt to execute an opcode to which no instruction is assigned causes a trap, 
specifically: 


- Attempting to execute a reserved FPop (floating-point opcode) causes a 
fp. exception other exception (with FSR. ftt = unimplemented FPop). 


- Attempting to execute any other reserved opcode causes an illegal instruction exception. 


- Attempting to execute an FPop with a nonzero value in a reserved instruction field causes 
afp exception other exception (with FSR. ftt — unimplemented FPop).! 


- Attempting to execute a Tcc instruction with a nonzero value in a reserved instruction 
field causes an illegal instruction exception. 


- Attempting to execute any other instruction with a nonzero value in a reserved instruction 
field causes an illegal instruction exception.! 


7.12.1 Summary of Unimplemented Instructions 


Certain SPARC-V9 instructions are not implemented in hardware in the processor. Executing 
any of these instructions results in the behavior described in TABLE 7-11. 


TABLE 7-11 Processor Actions on Unimplemented Instructions 














Instructions Trap Taken Processor-specific Behavior Operating System Response 
Quad FPops (including | fp. exception other FSR.ftt = unimplemented FPop | Emulates Instruction 
POPC illegal instruction None Emulates Instruction 
RDPR FQ illegal instruction None Skips Instruction and Returns 
illegal instruction None Emulates Instruction 
illegal instruction None Emulates Instruction 

















1. Although it is recommended that this exception is generated, a JPS1 implementation may ignore the contents of reserved 
instruction fields (in instructions other than Tcc). 
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If a trap does not occur and the instruction is not a control transfer, the next program 
counter (nPC) is copied into the PC, and the nPC is incremented by four (ignoring overflow, 
if any). If the instruction is a control transfer instruction, the nPC is copied into the PC and 
the target address is written to nPC. Thus, the two program counters provide for a delayed- 
branch execution model. 


For each instruction access and each normal data access, the IU appends an 8-bit address 
space identifier (ASI) to the 64-bit memory address. Load/store alternate instructions (see 
Section 7.2.4, “Address Space Identifiers (ASIs)”) can provide an arbitrary ASI with their 
data addresses or can use the ASI value currently contained in the ASI register. 





TAB 


7.13.1 


Big/Little-Endian Addressing 


The processor uses big-endian byte order for all instruction accesses and, by default, for data 
accesses. 


It is possible to access data in little-endian format by using selected ASIs. 


It is also possible to change the default byte order for implicit data accesses. 


Big-Endian Addressing Convention 


Within a multiple-byte integer, the byte with the smallest address is the most significant; a 
byte’s significance decreases as its address increases. The big-endian addressing conventions 
are illustrated in FIGURE 7-4 and described below the figure. 
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Byte 


Halfword 


Word 


Doubleword/ 


Extended word 


Quadword 


Address 
Address<0> = 0 1 

15 8 
Address<1:0> = 00 01 10 11 

31 24| 23 16] 15 87 0| 
Address<2:0> = 000 001 010 011 
Address<2:0> = 100 101 110 111 

31 00 | 00 00|15 007 . ol 
Address<3:0> = 0000 0001 0010 0011 

127 120] 119 112] 111 104 |103 96 
Address<3:0> = 0100 0101 0110 0111 

95 88| 87 80] 79 72|71 64 
Address<3:0> = 1000 1001 1010 1011 

63 56| 55 48] 47 40]39 32 
Address<3:0> = 1100 1101 1110 1111 

31 24| 23 16] 15 87 0| 


FIGURE 7-4. Big-Endian Addressing Convention 


big-endian byte 


big-endian halfword 


big-endian word 


big-endian doubleword 
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or extended word 


A load/store byte instruction accesses the addressed byte in both big-endian and little- 
endian modes. 


For a load/store halfword instruction, 2 bytes are accessed. The most significant byte 
(bits 15—8) is accessed at the address specified in the instruction; the least significant 
byte (bits 7—0) is accessed at the address + 1. 


For a load/store word instruction, 4 bytes are accessed. The most significant byte 
(bits 31—24) is accessed at the address specified in the instruction; the least significant 
byte (bits 7—0) is accessed at the address + 3. 


For a load/store extended or floating-point load/store double instruction, 8 bytes are 
accessed. The most significant byte (bits 63—56) is accessed at the address specified in 
the instruction; the least significant byte (bits 7—0) is accessed at the address + 7. 


For the deprecated integer load/store double instructions (LDD/STD), two big-endian 
words are accessed. The word at the address specified in the instruction corresponds to 
the even register specified in the instruction; the word at address + 4 corresponds to the 
following odd-numbered register. 
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big-endian guadword For a load/store guadword instruction, 16 bytes are accessed. The most significant byte 
(bits 127—120) is accessed at the address specified in the instruction; the least 
significant byte (bits 7—0) is accessed at the address + 15. 


7.13.2 Little-Endian Addressing Convention 


Within a multiple-byte integer, the byte with the smallest address is the least significant; a 
byte’s significance increases as its address increases. The little-endian addressing 
conventions are illustrated in FIGURE 7-5 and defined below the figure. 


Byte 


Halfword 


Word 


Doubleword/ 
Extended word 


Quadword 


Address 
Address<0> = 0 1 
Address<1:0> = 00 01 10 11 
Address<2:0> = 000 001 010 011 
15 8| 23 16131 24 
Address<2:0> = 100 101 110 111 
55 TOKE 56 
Address<3:0> = 0000 0001 0010 0011 
15 8] 23 16] 31 24 
Address<3:0> = 0100 0101 0110 0111 
39 32 | 47 40] 55 48] 63 56 
Address<3:0> - 1000 1001 1010 1011 
71 64] 79 72| 87 80] 95 88 
Address<3:0> - 1100 1101 1110 1111 


103 96| 111 104| 119 112] 127 120 


FIGURE 7-5 Little-Endian Addressing Conventions 


little-endian byte — A load/store byte instruction accesses the addressed byte in both big-endian and little- 
endian modes. 


little-endian halfword For a load/store halfword instruction, 2 bytes are accessed. The least significant byte 
(bits 7—0) is accessed at the address specified in the instruction; the most significant 
byte (bits 15—8) is accessed at the address + 1. 
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N 
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little-endian word 


little-endian doubleword 
or extended word 


little-endian quadword 


180 


For a load/store word instruction, 4 bytes are accessed. The least significant byte 
(bits 7—0) is accessed at the address specified in the instruction; the most significant 
byte (bits 31—24) is accessed at the address + 3. 


For a load/store extended or floating-point load/store double instruction, 8 bytes are 
accessed. The least significant byte (bits 7—0) is accessed at the address specified in the 
instruction; the most significant byte (bits 63—56) is accessed at the address + 7. 


For the deprecated integer load/store double instructions (LDD/STD), two little-endian 
words are accessed. The word at the address specified in the instruction corresponds to 
the even register in the instruction; the word at the address specified in the instruction 
plus four corresponds to the following odd-numbered register. With respect to little- 
endian memory, an LDD (STD) instruction behaves as if it is composed of two 32-bit 
loads (stores), each of which is byte-swapped independently before being written into 
each destination register (memory word). 


For a load/store quadword instruction, 16 bytes are accessed. The least significant byte 
(bits 7—0) is accessed at the address specified in the instruction; the most significant 
byte (bits 127—120) is accessed at the address + 15. 
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CHAPTER 8 





Memory Models 





The SPARC-V9 architecture is a model that specifies the behavior observable by software on 
SPARC-V9 systems. Therefore, access to memory can be implemented in any manner, as 
long as the behavior observed by software conforms to that of the models described in the 
following: 


Chapter 8 of The SPARC Architecture Manual, Version 9 

Appendix D of The SPARC Architecture Manual, Version 9 
The SPARC-V9 architecture defines three different memory models: Total Store Order 
(TSO), Partial Store Order (PSO), and Relaxed Memory Order (RMO). The UltraSPARC IIIi 
processor implements TSO, the strongest of the memory models defined by SPARC-V9. By 


implementing TSO, software written for any memory model (TSO, PSO, and RMO) executes 
correctly on the UltraSPARC IIIi processor. 


This chapter departs from the organization of the memory models described in The SPARC 
Architecture Manual, Version 9. It describes the characteristics of the memory models for the 
UItraSPARC IIIi processor in sections organized as follows: 


TSO Behavior 

Memory Location Identification 
Memory Accesses and Cacheability 
Memory Synchronization 

Atomic Operations 

Non-Faulting Load 

Prefetch Instructions 

Block Loads and Stores 

I/O and Accesses with Side-Effects 
Internal ASIs 

Store Compression 


Read After Write (RAW) Bypassing 
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8.1 


TSO Behavior 


The UltraSPARC IIIi processor implements the TSO memory model. The current memory 
model is indicated in the PSTATE . MM field and is set to TSO (PSTATE.MM = 0). 








In some cases, the UltraSPARC Ili processor implements stronger ordering than the TSO 
requirements. The significant cases are listed below: 





A MEMBAR #Lookaside is not needed between a store and a subsequent load to the 
same non-cacheable address. 











Accesses with the TTE.E bit set, such as those that have side-effects, are all strongly 
ordered with respect to one another. 





An L2-cache or W-cache update is delayed on a store hit until all previous stores reach 
global visibility. For example, a cacheable store following a non-cacheable store will not 
appear globally visible until the non-cacheable store has become globally visible; there is 
an implicit MEMBAR #MemIssue between them. 








8.2 


Memory Location Identification 


A memory location is identified by an 8-bit address space identifier (ASI) and a 64-bit 
(virtual) address. The 8-bit ASI can be obtained from an ASI register or included in a 
memory access instruction. The ASI distinguishes among and provides an attribute to 
different 64-bit address spaces. For example, the ASI is used by the MMU and memory 
access hardware for control of virtual-to-physical address translations, access to 
implementation-dependent control and data registers, and access protection. Attempts by 
non-privileged software (PSTATE.PRIV = 0) to access restricted ASIs (ASI<7> = 0) cause a 
privileged_action exception. 








8.3 
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Memory Accesses and Cacheability 


Memory is logically divided into real memory (cached) and I/O memory (non-cached with 

and without side-effects) spaces. Real memory spaces can be accessed without side-effects. 

For example, a read from real memory space returns the information most recently written. 

In addition, an access to real memory space does not result in program-visible side-effects. In 
contrast, a read from I/O space may not return the most recently written information and may 
result in program-visible side-effects. 
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8.3.1 


8.3.1.1 


8.3.1.2 


Coherence Domains 


The two types of memory operations supported in the UltraSPARC IIIi processor are 
cacheable and non-cacheable accesses, as indicated by the page translation (TTE . CP, 
TTE.CV) of the MMU or by an ASI override. 








SPARC-V9 does not specify memory ordering between cacheable and non-cacheable 
accesses. The UltraSPARC IIli processor maintains TSO ordering between memory 
references regardless of their cacheability. 


Cacheable Accesses 


Accesses within the coherence domain are called cacheable accesses. They have the 
following properties: 


Data reside in real memory locations. 
Accesses observe supported cache coherency protocol(s). 


The unit of coherence is 64 bytes. 


Non-Cacheable and Side-Effect Accesses 


Accesses outside of the coherence domain are called non-cacheable accesses. Some of these 
memory-mapped locations may have side-effects when accessed. They have the following 
properties: 
Data might not reside in real memory locations. Accesses may result in programmer- 
visible side-effects. An example is memory-mapped I/O control registers, such as those in 
a UART. 


Accesses do not observe supported cache coherency protocol(s). 


The smallest unit in each transaction is a single byte. 














Non-cacheable accesses with the TTE . E bit set (those having side-effects) are all strongly 
ordered with respect to other non-cacheable accesses with the E bit set. In addition, store 
compression is disabled for these accesses. Speculative loads with the E bit set cause a 
data access exception trap (with SFSR.FT = 2, speculative load to page marked with 

E bit). 




















Note — TTE.E bit comes from the page translation of the MMU or an ASI override. 











Chapter 8 Memory Models 185 


8.3.2 


8.3.3 
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Non-cacheable accesses with the TTE . E bit cleared (non-side-effect accesses) are processor 
consistent and obey TSO memory ordering. In particular, processor consistency ensures that 
a non-cacheable load that references the same location as a previous non-cacheable store will 
load the data of the previous store. Store compression is supported. See Section 8.11, “Store 
Compression” for details. 











Note — Side-effect, as indicated in TTE . E, does not imply non-cacheability. 








Global Visibility 


A memory access is considered globally visible when the transaction request is issued on 
JBUS. 


Memory Ordering 


To ensure the correct ordering between cacheable and non-cacheable domains, explicit 
memory synchronization is needed in the form of MEMBAR instructions. CODE EXAMPLE 8-1 
illustrates the issues involved in mixing cacheable and non-cacheable accesses. 





CODE EXAMPLE 8-1 Memory Ordering and MEMBAR Examples 








Assume that all accesses go to non-sid ffect memory locations. 
Process At 








While (1) 
{ 
Store Dl:data produced 
1 MEMBAR #StoreStore (needed in PSO, RMO for SPARC-V9 compliance) 
Store Fl:set flag 
While Fl is set (spin on flag) 
Load F1 
2 MEMBAR #LoadLoad, #LoadStore (needed in RMO for SPARC-V9 
compliance) 
Load D2 





Process Bi 

While (1) 

{ 
While Fl is cleared (spin on flag) 
Load F1 
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CODE EXAMPLE 8-1 Memory Ordering and MEMBAR Examples (Continued) 





2 MEMBAR #LoadLoad, #LoadStore (needed in RMO for SPARC-V9 
compliance) 





Store D2 
1 MEMBAR #StoreStore (needed in PSO, RMO for SPARC-V9 compliance) 
Store Fl:clear flag 








8.4 


Memory Synchronization 


Normal loads and stores by an UltraSPARC IIli processor are performed in order. TSO 
defines how other processors may see the ordering of the loads and stores of a particular 
processor. Memory synchronizations are used to force the ordering that other processors see 
beyond the rules of TSO. 


In some cases, memory synchronizations are required for deterministic behavior, even with 
respect to the program’s own operations. This applies to memory operations outside of 
normal cacheable loads and stores. 





The UltraSPARC IIli processor achieves memory synchronization through MEMBAR and 
FLUSH. It provides MEMBAR (STBAR in SPARC-V8) and FLUSH instructions for explicit 
control of memory ordering in program execution. MEMBAR has several variations. All 
MEMBARs are implemented in one of two ways in the UltraSPARC IIi processor: 


- Asa NOP 
- With MEMBAR #Sync semantics 











Since the processor always executes with TSO memory ordering semantics, three of the 
ordering MEMBARs are implemented as NOPs. TABLE 8-1 lists the MEMBAR implementations. 











TABLE 8-1 MEMBAR Semantics 
































MEMBAR Semantics 
LoadLoad NOP. All loads wait for completion of all previous loads. 
LoadStore NOP. All stores wait for completion of all previous loads. 
Lookaside t Sync. Wait until store buffer is empty. 
StoreStore, NOP. All stores wait for completion of all previous stores. 

STBAR 
StoreLoad #Sync. All loads wait for completion of all previous stores. 
MemIssue t Sync. Wait until all outstanding memory accesses complete. 
Sync f Sync. Wait for all outstanding instructions and all deferred errors. 
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8.4.1 


8.4.2 
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MEMBAR #Sync 


MEMBAR #Sync forces all outstanding instructions and all deferred errors to be completed 
before any instructions after the MEMBAR are issued. 








MEMBAR Rules 


TABLE 8-2 and TABLE 8-3 summarize the cases where the programmer must insert a MEMBAR 
to ensure ordering between two memory operations on the UltraSPARC IIi processor. Use 
TABLE 8-2 and TABLE 8-3 for ordering purposes only. Be sure not to confuse memory 
operation ordering with processor consistency or deterministic operation; MEMBARs are 
required for deterministic operation of certain ASI register updates. 











Caution — The MEMBAR requirements for the UltraSPARC Illi processor are less stringent 
than the requirements of SPARC-V9. To ensure code portability across systems, use the 
stronger of the MEMBAR requirements of SPARC-V9. 








Read the tables as follows: Read from row to column; the first memory operation in program 
order in a row is followed by the memory operation found in the column. Two symbols are 
used as table entries: 


- ff — No intervening operation is required because Fireplane-compliant systems 
automatically order R before C. 


+ M — MEMBAR #Sync or MEMBAR #MemIssue or MEMBAR #StoreLoad 











For VA<12:5> of a column operation not matching with VA<2:5> of a row operation while a 
strong ordering is desired, the MEMBAR rules summarized in TABLE 8-2 reflect the 
UltraSPARC IIIi processor hardware implementation. 
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TABLE 8-2 MEMBAR Rules for Column VA «12:5» + Row VA «12:5» While Desiring Strong 















































Ordering 
To Column Operation C: 
9 = 
< o 
E - 
5 E t 
- = E 
c o o 
— - o o c E 
E £ o | c | o o 
| o | o o o c 
o o o € o € | c | 
= ” $5 b JE lle |? |? jə fe 
From Row v v o 2 E v 9 |s 2 S o o E: o 
i S S o o o |s o © o o 9 5 o 5 
Operation R: 9 9 D w le ^ 9 €*? |a |a |a |s 2 
load # # # # # # # # # M IM # M IM 
load from internal ASI # # # # # # # # # # # # # # 
store M # # # # M # M # M M # M M 
store to internal ASI # M # # # # # # # M # # M M 
atomic # # # # # # # # # M IM # M IM 
load_nc_e # # # # # # # # # M M # M M 
store_nc_e M # # # # # # M # M M # M M 
load_nc_ne # # # # # # # # # M M # M M 
store_nc_ne M # # # # M # M # M M # M M 
bload M # M # M|M!M |M |M {1M M # M IM 
bstore M # M # M M M M M M M # M M 
bstore_commit M # M # M M M M M M M # M M 
bload nc M # M # M|M!M M|MI|M M # M IM 
bstore nc M # M # M M M M M M M # M M 
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When VA<12:5> of a column operation matches VA<12:5> of a row operation, the M! 
rules summarized in TABLE 8-3 reflect the UltraSPARC IIIi processor hardware 


implementation. 


TABLE8-3 MI 
Ordering 








EMBAR 


EMBAR Rules for Column VA<12:5> = Row VA«12:5» While Desiring Strong 





To Column Operation C: 







































































9 ys 
< [7] 
E: = 
E © rA 
g E E 
c o o 
= = o o c E 
E E 9 | a | o B o 
o o o c 
S 9 l.|g |8 |g |8 15 |e 
= = | | o o 
From Row D e Elo |? | |? o |s S 
: © o o |S o © o 5 o 5 
Operation R: 9 D s|2 |" |e | o |s ja 
load # # # # # # # # # # 
load from internal ASI # # # # # # # # # # # # 
store # # # # # # # # # M # # # # 
store to internal ASI # M # # # # # # # M # # M M 
atomic # # # # # # # # # # 
load_nc_e # # # # # # # # # # 
load_nc_ne # # # # # # # # # # # # # # 
store_nc_ne # # # # # # # # # M # # M # 
bload # # # # # # # # # # # # # # 
bstore # # # # # # # # # M # # # # 
bload_nc # # # # # # # # # # # # 
bstore_nc # # # # # # # # # # M # 
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FLUSH 


FLUSH behaves like a MI 





EMBAR with further restrictions. MI 











EMBAR blocks execution of 


subsequent instructions until all memory operations and errors are resolved. FLUSH is 
similar with further behavior in that all instruction fetch and instruction buffering operations 
are also blocked. 
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8.5 


Atomic Operations 


SPARC-V9 provides three atomic instructions to support mutual exclusion, including: 


SWAP — Atomically exchanges the lower 32 bits in an integer register with a word in 
memory. This instruction is issued only after store buffers are empty. Subsequent loads 
interlock on earlier SWAPs. 





If a page is marked as virtually non-cacheable but physically cacheable (TTE . CV = 0 and 
TTE.CP = 1), allocation is done to the L2-cache and W-cache only. This includes all of 
the atomic-access instructions. 





LDSTUB — Behaves like a SWAP except that it loads a byte from memory into an integer 
register and atomically writes all 1’s (FF;c) into the addressed byte. 


Compare and Swap (CAS (X) A) — Combines a load, compare, and store into a single 
atomic instruction. It compares the value in an integer register to a value in memory. If 
they are equal, the value in memory is swapped with the contents of a second integer 
register. If they are not equal, the value in memory is still swapped with the contents of 
the second integer register, but is not stored. The L2-cache will still go into M-state, even 
if there is no store. 


All of these operations are carried out atomically; in other words, no other memory 
operation can be applied to the addressed memory location until the entire compare-and- 
swap sequence is completed. 


These instructions behave like both a load and store access, but the operation is carried out 
indivisibly. These instructions can be used only in the cacheable domain (not in non- 
cacheable I/O addresses). 


These atomic instructions can be used with the ASIs listed in TABLE 8-4. Access with a 
restricted ASI in unprivileged mode (PSTATE.PRIV = 0) results in a privileged action trap. 
Atomic accesses with non-cacheable addresses cause a data access exception trap (with 
SFSR.FT = 4, atomic to page marked non-cacheable). Atomic accesses with unsupported 
ASIs cause a data access exception trap (with SFSR.FT = 8, illegal ASI value or virtual 
address). 





TABLE8-4 — ASIs That Support SWAP, LDSTUB, and CAS 


























ASI Name Access 

ASI NUCLEUS (LITTLE) Restricted 
ASI AS IF USER PRIMARY (LITTLE) Restricted 
ASI AS IF USER SECONDARY (LITTLE) Restricted 
ASI PRIMARY (LITTLE) Unrestricted 
ASI SECONDARY (LITTLE) Unrestricted 
ASI PHYS USE EC (LITTLE) Restricted 
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Note — Atomic accesses with non-faulting ASIs are not allowed, because the latter have the 
load-only attribute. 








8.6 Non-Faulting Load 


A non-faulting load behaves like a normal load, with the following exceptions: 








It does not allow side-effect access. An access with the TTE.E bit set causes a 
data access exception trap (with SFSR.FT = 2, speculative load to page marked E bit). 














It can be applied to a page with the TTE . NFO (non-fault access only) bit set; other types 
of accesses cause a data access exception trap (with SFSR.FT = 10,6, normal access to 
page marked NFO). 








These loads are issued with ASI PRIMARY NO FAULT( LITTLE) or 
ASI SECONDARY NO FAULT(. LITTLE]. A store with a NO. FAULT ASI causes a 
data access exception trap (with SFSR.FT = 8, illegal RW). 

















When a non-faulting load encounters a TLB miss, the operating system should attempt to 
translate the page. If the translation results in an error, then zero is returned and the load 
completes silently. 


Typically, optimizers use non-faulting loads to move loads across conditional control 
structures that guard their use. This technique potentially increases the distance between a 
load of data and the first use of that data, in order to hide latency. The technique allows more 
flexibility in code scheduling and improves performance in certain algorithms by removing 
address checking from the critical code path. 


For example, when following a linked list, non-faulting loads allow the null pointer to be 
accessed safely in a speculative, read-ahead fashion; the page at virtual address 0,6 can safely 
be accessed with no penalty. The NFO bit in the MMU marks pages that are mapped for safe 
access by non-faulting loads, but that can still cause a trap by other, normal accesses. 


Thus, programmers can trap on wild pointer references—many programmers count on an 
exception being generated when accessing address 0), to debug code—while benefiting from 
the acceleration of non-faulting access in debugged library routines. 
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8.7 


Prefetch Instructions 


The UltraSPARC IIli processor implements all SPARC-V9 prefetch instructions except for 


prefetch page. All prefetches check the L2-cache before issuing a system request for the 


requested data. Prefetch instructions are a performance feature. Prefetch instructions do not 


change the underlying memory model and do not have any effect from an architectural 


standpoint. 


TABLE 8-5 describes prefetch instructions. 









































TABLE 8-5 ‘Types of Software Prefetch Instructions 
fen 
Value Prefetch (64 bytes of | Instruction Request Exclusive 
(hex) Instruction Type data) into: Strength Ownership 
00 Prefetch read many | P-cache and Weak No 
L2-cache 
01 Prefetch read once P-cache only Weak No 
02 Prefetch write many | L2-cache only Weak Yes 
03 Prefetch write once! | L2-cache only Weak No 
04 Reserved Undefined 
05 - Reserved Undefined 
OF 
10 Prefetch invalidate Invalidates a P-cache N/A 
line, no data is 
prefetched. 
11- Reserved Undefined 
13 
14 Same as fcn = 00 Weak? No 
15 Same as fcn = 01 Weak? No 
16 Same as fcn = 02 Weak? Yes 
17 Same as fcn = 03 Weak? No 
18 - Reserved Undefined 
1F 























1. Although the name is “prefetch write once,” the actual use is prefetch to L2-cache for a future read. 


2. These weak instructions may be implemented as strong in future implementations. 
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8.8 


Block Loads and Stores 


Block load and store instructions work like normal floating-point load and store instructions, 
except that the data size (granularity) is 64 bytes per transfer. 


Block loads and stores do not obey TSO. They do not even obey the processor’s consistency 
rules without the correct use of MEMBAR. Section A.4 “Block Load and Block Store (VIS I)” 
on page A-274 discusses block loads and stores in detail. 
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I/O and Accesses with Side-Effects 


I/O locations might not behave with memory semantics. Loads and stores could have side- 
effects; for example, a read access could clear a register or pop an entry off a FIFO. A write 
access could set a register address port so that the next access to that address will read or 
write a particular internal register. Such devices are considered order sensitive. Also, such 
devices may only allow accesses of a fixed size, so store merging of adjacent stores or stores 
within a 16-byte region would cause an error. 





The UltraSPARC Illi MMU includes an attribute bit in each page translation, TTE. E, which 
when set signifies that this page has side-effects. Accesses other than block loads or stores to 
pages that have this bit set exhibit the following behavior: 











Non-cacheable accesses are strongly ordered with respect to each other. 





Non-cacheable loads with the E bit set will not be issued to the system until all previous 
control transfers are resolved. 





Non-cacheable store compression is disabled for E bit accesses. 





Exactly those E bit accesses implied by the program are made in program order. 





Non-faulting loads are not allowed and cause a data access exception (with 
SFSR.FT - 2, speculative load to page marked E bit). 








For portability across SPARC-V9 processors, a MEMBAR may be needed between side- 
effect and non-side-effect accesses while in PSO and RMO modes, as well as in some 
cases of TSO. 
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8.9.1 


8.9.2 


Instruction Prefetch to Side-Effect Locations 


The processor does instruction prefetching and follows branches that it predicts are taken. 
Addresses mapped by the I-MMU can be accessed even though they are not actually 
executed by the program. Normally, locations with side-effects or that generate timeouts or 
bus errors are not mapped by the I-MMU; therefore, prefetching will not cause problems. 


When running with the I-MMU disabled, software must avoid placing data in the path of a 
control transfer instruction target or sequentially following a trap or conditional branch 
instruction. Data can be placed sequentially following the delay slot of a BA, BPA (p= 1), 
CALL, or JMPL instruction. Instructions should not be placed closer than 256 bytes to 
locations with side-effects. 


Instruction Prefetch Exiting Red State 





Exiting RED_state by writing zero to PSTATE .RED in the delay slot of a JMPL 
instruction is not recommended. A non-cacheable instruction prefetch may be made to the 
JMPL target, which may be in a cacheable memory area. This situation can result in a bus 
error on some systems and can cause an instruction access error trap. Programmers can mask 
the trap by setting the NCEEN bit in the L2-cache Error Enable Register to zero, but doing so 
will mask all non-correctable error checking. Exiting RED_state with DONE, RETRY, or 
with the destination of the JMPL non-cacheable will avoid the problem. 












































8.10 


Internal ASIs 


ASIs in the ranges 30,,—6F 1, and 72;,—7F; are used for accessing internal states. Stores to 
these ASIs do not follow the normal memory-model ordering rules. Correct operation can be 
assured by adhering to the following requirements: 





A MEMBAR #Sync is needed after a store to an internal ASI other than MMU ASIs before 
the point that side-effects must be visible. This MEMBAR must precede the next load or 
non-internal store. To avoid data corruption, the MEMBAR must also occur before the delay 
slot of a delayed control transfer instruction of any type. 








Alternatively, a MEMBAR #Sync could be inserted at the beginning of any vulnerable trap 
handler. “Vulnerable” trap handlers are those which contain one or more LDXAs from any 
internal ASI (ASIs 0x30-0x6F, 0x72-0x77, and 0x7A-0x7F). However, this may cause an 
unacceptable performance reduction in some trap handlers, so this is not the preferred 
alternative. 
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A FLUSH, DONE, or RETRY is needed after a store to an internal I-MMU ASI (ASI 5016- 
5216 5416-5F 6), an I-cache ASI (66;6—6F 16), or the IC bit in the DCU Control Register, 

prior to the point that side-effects must be visible. A store to D-MMU registers other than 
the context ASIs can use a MEMBAR #Sync. To avoid data corruption, the MEMBAR must 
also occur before the delay slot of a delayed control transfer instruction of any type. 


If the store is to an I-MMU state register (ASI = 50,6, virtual address = 18,6), then the 
FLUSH, DONE, or RETRY must immediately follow the store. Furthermore, one of the 
following must be true, to prevent an intervening I-TLB miss from causing stale data to be 
stored: 

The code must be locked down in the I-TLB, or 

The store and the subsequent FLUSH, DONE, or RETRY should be kept on the same 

8 KB page of instruction memory. 
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Store Compression 


Consecutive non-side-effect, non-cacheable stores can be combined into aligned 16-byte 
entries in the store buffer to improve store bandwidth. Cacheable stores will naturally 
coalesce in the W-cache rather than be compressed in the store buffer. Non-cacheable stores 
can be compressed only with adjacent non-cacheable stores. To maintain strong ordering for 
I/O accesses, stores with the side-effect attribute (E bit set) cannot be combined with any 
other stores. 





A 16-byte non-cacheable merge buffer is used to coalesce adjacent non-cacheable stores. 
Non-cacheable stores will continue to coalesce into the 16-byte buffer until one of the 
following conditions occurs: 


The data is pulled from the non-cacheable merge buffer by the target device. 


The store overwrites a previously written entry (a valid bit is kept for each of the 
16 bytes). 





Caution — This behavior is unique to the UltraSPARC Illi processor and differs from 
previous UltraSPARC processor implementations. 





The store is not within the current address range of the merge buffer (within the 16-byte 
aligned merge region). 


The store is a cacheable store. 
The store is to a side-effect page. 


MEMBAR #Sync 
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8.12 


8.12.1 


Read After Write (RAW) Bypassing 


Load data can be bypassed from previous stores before they become globally visible (data for 
load from the store queue). This is specifically allowed by the TSO memory model. Data for 
all types of loads cannot be bypassed from all types of stores. 


All types of load instructions can get data from the store queue, except the following load 
instructions: 

Signed loads (1dsb, 1dsh, ldsw) 

Atomics 

Load double to integer register file (1dd) 

Quad loads to integer register file 

Load from FSR register 

Block loads 

Short floating-point loads 


Loads from internal ASIs 


All types of store instructions can give data to a load, except the following store instructions: 
Floating-point partial stores 
Store double from integer register file (std) 
Store part of atomic 
Short FP stores 
Stores to pages with side-effect bit set 


Stores to non-cacheable pages 


RAW Bypassing Algorithm 


The algorithm used in the UltraSPARC IIi processor for RAW bypassing is as follows: 


if ( (Load/store access the same physical address) and 
Load/store endianness is the same) and 
Load/store size is the same) and 


( 
( 
(Load data can get its data from store queue) and 
(Store data in store can give its data to a load) and 
(Load hits in either D-cache or P-cache) 
) 
then 


Load will get its data from store queue 
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else 


Load will get its data from the memory system 


endif 


RAW Detection Algorithm 


When data for a load cannot be bypassed from previous stores before they become globally 
visible (store data is not yet retired from the store queue), the load is recirculated after the 
RAW hazard is removed. The following conditions can cause this recirculation: 


Load data can be bypassed from more than one store in the store queue. 


The load’s VA<12:0> overlaps a store’s VA<12:0> and store data cannot be bypassed from 


the store queue. 
The load's VA<12:5> matches a store's VA«1 


Load is from side-effect page (page attribute 1 
more stores to side-effect pages. 


2:5> and the load misses the D-cache. 





E = 1) when the store queue contains one or 
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CHAPTER 9 





Caches and Coherency 





This chapter describes the use of caches and TLBs, and contains the following sections: 
Cache Organization 
Cache Flushing 
Controlling P-Cache 
Translation Lookaside Buffers (TLBs) 





9.1 


2 T. 


9.1.1.1 


Cache Organization 


Virtually Indexed, Physically Tagged Caches (VIPT) 


The D-cache is Virtually Indexed, Physically Tagged (VIPT). Virtual addresses are used to 
index into the cache tag and data arrays while accessing the D-MMU (that is, D-TLBs). The 
resulting tag is compared against the translated physical address to determine a cache hit. 


A side-effect inherent in a virtual-indexed cache is address aliasing. This issue is addressed 
in Section 9.2.1 “Address Aliasing Flushing” on page 206. 


Data Cache (D-Cache) 


The Data Cache is a write-through, non-allocating on a write miss, 64 KB, pseudo-4-way 
associative cache with a 32-byte line. 


Data accesses bypass the data cache when: 


The Data Cache enable (DC) bit in the Data Cache Unit Control Register (DCUCR) is 
clear, or 
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The D-MMU Enable (DCUCR . DM) bit and the virtual cacheability (DCUCR. CV) bit are 
clear, or 


The access is mapped by the D-MMU as non-virtual-cacheable 


Note — A non-virtual-cacheable access may access data in the Data Cache from an earlier 
cacheable access to the same physical block, unless the Data Cache is disabled. Software 
must flush the Data Cache when changing a physical page from cacheable to non-cacheable 
(see Section 9.2 “Cache Flushing” on page 205). 


Bypassing the D-Cache 


D-cache can return stale data if CP == 1, CV == 0 is used to bypass the cache, after use of 
CP==1 and CV==1, for loads and stores to a particular address. 


D-cache should be flushed, after mixing use of any CP/CV settings for a physical address, 
including cacheable (DRAM) and non-cacheable (I/O) physical addresses. 


The term “virtually non-cacheable” refers to the “non-D-cacheable” CP == 1, CV == 0 case, 
as opposed to the more common use of “non-cacheable” to describe I/O or graphics related 
physical addresses. 


CP == 1, CV == 1: Cacheable, Virtually-cacheable 


CP == 1, CV == 0: Cacheable, Virtually-non-cacheable (ASI PHYS USE EC has this 
effect) 


CP == 0, CV == 1: P-cacheable 
CP == 0, CV == 0: Non-cacheable 














Only two indexes in the D-cache need to be flushed for each 32-byte aligned physical 
address: 


{VA[13] == 0,PA[12:5]} and 
{VA[13] == 1,PA[12:5]} 


Special Case 1 














When performing a load with a physical address, using ASI = 0x14 (ASI PHYS USE EC), 
causing CP == | and CV —- 0, and the address hits in the D-cache, the following describes 
how the data comes from D-cache instead of L2-cache: 


If CP == 0 and CV == 0, which indicates a “non-cacheable” access, and the address is in the 
D-cache, data can be returned from the D-cache. 


The address should be flushed from the D-cache before changing its mapping. 
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9.1.2.2 


9.1.3 


9.1.3.1 


Similarly, if CP == 1, and CV == 0, and the data is in the D-cache, data may be returned 
from the D-cache. However, there are corner cases where it may not be the case. 





For instance, with ASI_PHYS_USE_EC, the physical PA[13] is used to index the D-cache, 
where VA[13] would ordinarily be used. Therefore, the data might not be correctly returned 
if the real data was in VA[13] == 0, but PA[13] == 1. Ordinarily the rest of the PA bits will 
have a difference, therefore, it will miss in the D-cache, and go to the L2-cache correctly. 
This takes advantage of knowing that a valid PA can only exist in one VA[13] mapping at a 
time in the D-cache. 











This depends on how the addresses were mapped earlier, when the line was installed in the 
D-cache. 





This ASI PHYS USE EC load hitting on the D-cache behavior is not defined or tested, so 
software should not rely on it. 











Special Case 2 





When performing a store with a physical address, using ASI=0x14 (ASI PHYS USE EC), 
causing CP — 1 and CV —- 0, and the address hits in the D-cache, the following describes 
how the D-cache gets updated: 











The software should make sure the physical address is not in the D-cache, before accessing 
that address using CP == 1, CV == 0, whether by a TLB mapping, or using one of the special 
ASIs. 


Physically-Indexed, Physically-Tagged Caches (PIPT) 


Instruction Cache (I-Cache) 


The Instruction Cache is a 32KB pseudo 4-way, set-associative, write-invalidate cache with 
32-byte lines. Instruction fetches bypass the Instruction Cache when: 


The Instruction Cache enable (DCUCR. IC) is clear, or 


The I-MMU enable (DCUCR. IM) bit and the physical cacheability (DCUCR. CP) bit are 
clear, or 





The processor is in RED_state, or 


The fetch is mapped by the I-MMU as nonphysical-cacheable. 


The Instruction Cache snoops stores from other processors or DMA transfers, as well as 
stores in the same processor and block commit store. 
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The FLUSH instruction is not required to maintain coherency. Stores and block store commits 
invalidate the Instruction Cache, but do not flush instructions that have already been 
prefetched into the pipeline. A FLUSH, DONE, or RETRY instruction can be used to flush 
the pipeline. 














If a program changes I-cache mode to I-cache-ON from I-cache-OFF, then the next 
instruction fetching always causes an I-cache miss even if it is supposed to hit. This rule 
applies even when the DONE instruction turns on the I-cache by changing its status from 
RED_state to normal mode. For example, 








(in RED_state) 





setx 0x37e0000000007, $g1, $g2 
stxa $g2, [$g0] 0x45 // Turn on I-cache when 
processor 


// returns normal mode. 
done // Escape from RED state. 








(back to normal mode) 


nop // 1st instruction; this always causes an I-cache 
miss. 


9.13.2 Prefetch Cache (P-Cache) 


The P-cache is a write-invalidate, 2 KB, 4-way associative cache with a 64-byte line and two 
32-byte sub-blocks. It is physically-indexed and physically-tagged and never contains 
modified data. The P-cache only needs to be flushed for error handling. 


The “PREFETCH fcn-16" instruction can be used to invalidate, or flush a P-cache entry, and 
to prefetch non-cacheable data, after the data is loaded into registers from the P-cache. 


The cache line size is 64 bytes with 32-byte subblocks. The P-cache is globally invalidated 
on context changes and MMU updates, individual lines are invalidated on store hits. 
The P-cache is globally invalidated if any of the following conditions occur: 

Context registers are written. 

Demap operation in the D-MMU 

D-MMU is turned on or off. 


Individual lines are invalidated on any of the following conditions: 
A store hits 


An external snoop hit 








Use of software prefetch invalidate function. (PREFETCH with fcn = 16) 
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9.1.4 


The P-cache is used for software prefetch instructions as well as a autonomous hardware 
prefetch from the L2-cache. This cache never needs to be flushed (not even for address 
aliases). 


Second Level and Write Caches (L2-Cache, W-Cache) 


The on-chip L2-cache! and the W-cache—are physically-indexed, physically-tagged (PIPT). 
These caches have no references to virtual address and context information. The operating 
system needs no knowledge of such caches after initialization, except for stable storage 
management and error handling. 


The L2-Cache is a 1 MB unified, write-back, write-allocate, 4-way set associative cache with 
64-byte lines. The L2-cache does not include the contents of the Instruction Cache, Prefetch 
Cache and Data Cache. The replacement policy is pseudo-random. The L2-cache cannot be 
disabled by software. 


It is necessary to flush the L2-cache for stable storage. 


Instruction fetches bypass the L2-cache when the following occurs: 


I-MMU is disabled AND when the CP bit in the Data Cache Unit Control Register is not 
set. 





The processor is in RED_state. 
Access is mapped by the I-MMU as nonphysical cacheable. 
Data accesses bypass the L2-cache if the D-MMU enable bit in the DCU Control Register is 


clear, or if the access is mapped by the D-MMU as non-physical-cacheable (unless 
ASI PHYS USE EC is used). 











The system must provide a non-cacheable, scratch memory for booting code use until the 
MMUS are enabled. 


Block loads and block stores, which load or store a 64-byte block of data from memory to 
the floating-point register file, do not allocate into the L2-cache, in order to avoid pollution. 
Prefetch Read Once instructions, which load a 64-byte block of data into the P-cache, do not 
allocate into the L2-cache. 


The W-cache is a 2 KB, 4-way associative, with 64 bytes per line and 32-byte sub-blocks. 
The W-cache is included in the L2-cache, and flushing the L2-cache ensures that the 
W-cache has also been flushed. 


1. L2-cache and Embedded Cache (E-cache) are used interchangeably. 
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L2-Cache Replacement Policy 





The selection is more complicated when some of the ways are blocked using EC. block. 
That is not shown here. The victim way is determined by a 5-bit Linear Feedback Shift 
Register (LFSR,) which is described in the following code. Note that the code reflects the 


algorithm when all 4 ways are active. 


CODE EXAMPLE 9-1 reflects the cache replacement algorithm when all four ways of the 


L2-cache are active. 


CODE EXAMPLE 9-1 L2-Cache Replacement Policy 


module lfsr (rand out, event in, reset, clk); 


output [3:0] rand out; 
input event in; 
input reset; 


input clk; 
wire [4:0] lfsr_reg; 


dffe #(5) ff_lfsr (lfsr_reg, lfsr_in, ~reset, 





event_in, 


clk); 


// 01010 is the non-reachable state for this implementation. 


wire [4:0] lfsr_in = (-lfsr reg[0], 


^ 


lfsr reg[0] 
lfsr reg[3], 


lfsr reg[0] 


^ 


lfsr reg[0] 


// update on reads that miss the L2-cache 


assign event in = ec lt cs r dl & -ec lt we r 


^lt ec hit miss dl; 


dffire #(5) f lfsr (lfsr reg, lfsr in, reset, 


assign rand out = { lfsr reg[1] & lfsr reg[0], 


lfsr reg[1] & ~lfsr_reg[0] 
^lfsr reg[1] & lfsr reg[0], 








^lfsr reg[1] & -lfsr reg[0] 


endmodule 


lfsr reg[4], 


lfsr reg[2], 
lfsr reg[1]); 


dl & 


event in, 


, 


); 
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clk); 


9.1.6 


L2-Cache Locking 


Networking applications get performance boost if the interrupt code is in the L2-cache. 
Therefore, software can have guaranteed latency to certain critical data and instructions. The 
UltraSPARC IIIi processor supports way blocking, that is, software can enable/disable a way 
to take part in replacement strategy. Software could initialize a way with L2-cache diagnostic 
writes and then prohibit this way from the replacement algorithm. 


Software flushes a particular line in L2-cache even if it is locked, if it desires to do so by 
issuing the ASI. ECACHE FLUSH instruction. 

















Note — If software blocks all four ways of the L2-cache, then the ECU will behave as if 
only way 0 is blocked. 
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Cache Flushing 


Data in the write-invalidate or write-through caches can be flushed by invalidating the entry 
in the cache. Modified data in the L2-cache and W-cache must be written back to memory 
when flushed. 


Cache flushing is required in the following cases: 


A D-cache flush is needed when a physical page is changed from (virtually) cacheable to 
(virtually) non-cacheable, or an illegal address aliasing is created (see Section 9.2.1 
"Address Aliasing Flushing" on page 206). This is done using ASI 0x42, 

ASI DCACHE INVALIDATE, which specifies a physical address to flush, like for a 
system bus snoop. 








L2-cache flush is needed for stable storage. This is done with either a 

ASI ECACHE FLUSH ora store with ASI BLK COMMIT. Flushing the L2-cache will 
flush the corresponding blocks from the W-cache. See Section 9.2.2 *Committing Block 
Store Flushing" on page 206. 














L2-cache, D-cache, prefetch cache, and I-cache flushes may be required when an ECC 
error occurs on a read from the memory or the L2-cache. When an ECC error occurs, 
invalid data may be written into one of the caches and the cache lines must be flushed to 
prevent further corruption of data. 


Note — When flushing a single 64-byte line, with a given PA, there are sixteen locations that 
must be flushed in the D-cache. This is because it has 32-byte lines (two places), one VA 
index bit (two places), and the PA can simultaneously exist in all four ways of a set (four 
places). 
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9.2.1 


9.2.2 
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Address Aliasing Flushing 


A side-effect inherent in a virtual-indexed cache is illegal address aliasing. Aliasing occurs 
when multiple virtual addresses map to the same physical address. 





Caution — Since the D-cache is indexed with the virtual address bits and is larger than the 
minimum page size, it is possible for the different aliased virtual addresses to end up in 
different cache blocks. Such aliases are illegal because updates to one cache block will not 
be reflected in aliased cache blocks. (There are corner cases where the same cache block can 
end up in different ways, within the same set (index); the hardware will update all ways 
within a set that have the line.) 





Normally, software avoids illegal aliasing by forcing aliases to have the same address bits 
(virtual color) up to an alias boundary. The minimum alias boundary is 16 KB. 


When the alias boundary is violated, software must flush the D-cache if the page was 
virtually cacheable. In this case, only one mapping of the physical page can be allowed in the 
D-MMU at a time. 


Alternatively, software can turn off the virtual caching of illegally aliased pages. This allows 
multiple mapping of the alias to be in the D-MMU and avoids flushing the D-cache each time 
a different mapping is referenced. 


Note — A change in virtual color when allocating a free page does not require a D-cache 
flush, because the D-cache is write through. 





Committing Block Store Flushing 


Stable storage must be implemented by software cache flush. Examples of stable storage are 
battery-backed memory and a transaction log. Data which is present and modified in the 
L2-cache or the W-cache must be written back to the stable storage. 





Two ASIs (ASI BLK COMMIT PRIMARY and ASI BLK COMMIT SECONDARY) 
perform these write backs efficiently when software can ensure exclusive write access to the 
block being flushed. These ASIs write back the data from the floating-point registers to 
memory and invalidate the entry in the cache. The data in the floating-point registers must 
first be loaded by a block load instruction. A MEMBAR #Sync instruction can be used to 
ensure that the flush is complete. 
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9.2.3 L2-Cache Flushing 


L2-cache flushing may also be accomplished by ASI loads (ASI_ ECACHE_FLUSH). This is 
done by reading a range of addresses that map to the corresponding cache line in a particular 
way being flushed, forcing out modified entries in the local cache. The load ASI physical 
address will be the same as its virtual address, and will cause a miss if the line it is intended 
to replace is in a valid state (M/O/E/S) in the L2-cache. If the line is modified (M/O), the 
data will also be forced out to memory. The hardware will guarantee a read miss to the way 
accessed by the ASI even if there is a hit in any of the other ways. The fetched line will be 
installed in the Invalid state (T) in the L2-cache. 

















Note — Diagnostic ASI accesses to the L2-cache can be used to invalidate a line, but they 
are not an alternative to above type of flushing. Modified data in the L2-cache will not be 
written back to memory using these Diagnostic ASI accesses (these are destructive flushes). 














L2-cache flush operation is performed by accessing ASI 0x4E (ASI ECACHE FLUSH). 
This ASI can be accessed only by a privileged instruction. A privileged action trap if 
PSTATE.PRIV not set. The L2-cache flush ASI format is illustrated in FIGURE 9-1 and 
described in TABLE 9-1. 








eee Ee eee 


63 41 40 36 35 34 33 32 31 30 18 17 


FIGURE 9-1 L2-Cache Flush ASI Format 


TABLE 9-41  L2-Cache Flush ASI Format 














Bit Field Description 

63:43 — Reserved. Set to 0. 

42:41 — Reserved. Set to 0. Makes sure that the victimizing read is treated as a 
cacheable space. 

40:36 = Reserved 

35:34 — Reserved. 

33:32 EC_WAY L2 Way Selection 

31 — Reserved. Set to 1. 

30:18 — Reserved. Set to 0. 

17:6 EC_TAG_ADDR  |Index into the L2-cache 

5:0 — Reserved. Set to 0. 














A load using the L2-cache Flush ASI can be used to flush a L2-cache line with 
EC TAG ADDR supplying the index and EC. WAY providing the required way. 
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The loads will not generate a miss in L2-cache if there is no dirty data in the associated 
set/way. However, they will cause a miss if there is dirty data to be flushed (the W-cache data 
will be merged with L2-cache data if needed). The returned data for this load miss will be 
installed in an invalid state. A store to this ASI will execute like a NOP. 


Clean (S or E) lines are invalidated immediately. There is no JBUS read. 


The VA<42:0> is used directly to create the PA<42:0> used for the read that goes out to 
JBUS (as an RDS). 


PA<33:0> is used for the DRAM at each memory controller. PA<33:32> is used for the Chip 
Select decode, and not all encodings may point to a DIMM in a system. Therefore, it is not 
possible to create an address that will definitely read from a DRAM. 





The read will receive AFSR.JETO if a nonexistent port is used in the address, causing a fatal 
error (system reset). 


The read will not receive AFSR.TO if the DRAM does not exist on a valid port. Flush 
completes normally. Unknown data is installed in the invalid state. 


It is possible to log UE/FRU/RUE or CE/FRC/RCE due to the DRAM read, if DRAM exists 
at the address created by hardware. (A read is done to create a displacement flush.) If this 
happens, the processor traps like a normal read that triggered these errors. 


In a multiprocessor system, the target address must point to your own ID, because as a 
destination, the UltraSPARC IIIi processor cannot tolerate having to return multiple read 
error packets to different masters around the same time (the system will hang). By pointing 
to your own ID, a JBUS read error packet is not used. However, note that the address does 
not need to point to valid DRAM. 


It is possible that the JBUS read address may actually be in another processor’s cache. The 
data will be correctly returned from that cache. Since a JBUS RDS is used, any write 
permission will be removed at that cache (M to O). If the line was E, it will be reduced to S 
state in other caches. It is possible that such a cache read could cause an L2-cache error to be 
logged by that other processor. 


Note — Since the I-cache, D-cache, and P-cache are non-inclusive, flushing the L2-cache 
has no affect on them, and they may need to be flushed separately. The W-cache is inclusive, 
and gets flushed with the L2-cache, if necessary. 





9.3 
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Controlling P-Cache 





This section clarifies the use of DCUCR.PE, DCUCR.HPE, and DCUCR.SPE bits. 
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Note — Block loads do not cause installs into the P-cache. They are also not allowed to hit 


on the P-cache and, therefore, never triggers hardware prefetch. 


Non-cacheable address space never installs in P-cache or L2-cache, unless a software 
prefetch is done specifically to the non-cacheable address (should be followed by a prefetch 
invalidate to that address, after using the data). 


TABLE 9-2 Explanation of P-cache control bits 





FP load miss 








Hardware Software (32B) installed FP loads checked 
DCUCR. DCUCR. DCUCR. Prefetch Prefetch in the for P-Cache 
PE HPE SPE Enabled? Enabled? P-Cache? hit/miss? 
0 X X no no no no 
1 0 0 no no no yes 























9.4 


9.4.1 


Translation Lookaside Buffers (TLBs) 


The Instruction TLB has a 16-entry, fully-associative TLB to hold entries for 64 KB, 512 KB, 
4 MB pages, and all locked pages of any size, and a 128-entry, 2-way associative TLB is used 
for the unlocked 8 KB pages. 


The Data TLB has a 16-entry, fully-associative TLB to hold entries for unlocked 8 KB, 
64 KB, 512 KB, 4 MB pages, and all locked pages, and two 512-entry, 2-way associative 
TLBs used for unlocked 8 KB, 64 KB, 512 KB, or 4 MB pages. 


TLB Flushing 


A demap-all operation that removes all unlocked TTEs has been added to both the I-TLBs 
and D-TLBs. 
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9.4.2 


9.4.3 


9.4.4 


9.4.5 
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TTE Format 


The UltraSPARC IIIi processor now has the additional elements in the TTE format: 


Physical Address field was expanded from 28 bits (PA<40:13>, TTE<40:13>) to 30 bits 
(PA<42:13>, TTE<42:13>) 











A snoop bit was added to mark a page as outside the coherence domain (TTE<47>) 


Synchronous Fault Status Register (SFSR) Extensions 


One status bit has been added to the I/D-TLB SFSRs: 


NF — Set to indicate the faulting operation was a speculative load instruction 


A new fault type was added to the FT field of the SFSR to indicate an I/D-TLB miss. 


I/D Translation Storage Buffer Register 


Three new register extensions of the I/D-TSB register have been added to the 

UltraSPARC IIi processor. These registers allow a different TSB virtual address base to be 
used for each of the three virtual address spaces (Primary, Secondary, Nucleus) in the D-TLB 
and two virtual address spaces (Primary, Nucleus) in the I-TLB. On an 

I/D-TLB miss it selects which TSB Extension Register to use to form the TSB base address 
based on the virtual space accessed by the faulting instruction. 


TLB Data Access Register 


The access address for the TLB Data Access Register has been expanded to enable access to 
three TLBs each with up to 512 entries. 


Warning — Under some circumstances a diagnostic read from the fully associative TLBs 

(ASI DTLB DATA ACCESS REG (ASI = 0x5D) and ASI_ITLB_DATA_ACCESS_REG 
(ASI = 0x55) will return wrong data. Software should read the fully associative TLB Entry 
twice, back-to-back. The second access will return correct data. 
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9.4.5.1 Special Case for Data TLBs 


If after any memory access instruction that misses TLB is followed by a read (LDXA from 
ASI DTLB DATA ACCESS, REG, that is, AST = 0x5d) access from fully associative TLBs 
and the accessed TTE has page size set to 64KB/512KB/4MB then data returned from TLB 
will be wrong. 














9.4.5.2 Special Case for Instruction TLBs 


If after any instruction that misses instruction TLB is followed by a read (LDXA from 

ASI ITLB DATA ACCESS, REG, that is, AST=0x55) access from fully associative TLBs 
and the accessed TTE has page size set to 64KB/512KB/4MB then data returned from TLB 
will be wrong. 














9.4.6 TLB Diagnostic Register 


This is a new register to replace the function of the diagnostic bits in the TTE. 
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CHAPTER 1 0 





Interrupt Handling 





Processors and I/O devices can interrupt a selected processor by assembling and sending an 
interrupt packet consisting of eight 64-bit words of interrupt vector data. The contents of 
these data are defined by software convention. Thus, hardware interrupts and cross-calls can 
have the same hardware mechanism for interrupt delivery and can share a common software 
interface for processing. 


The interrupt requesting/receiving mechanism is a two-step process: the sending of an 
interrupt request on a vector data register to the target and the scheduling of the received 
interrupt request on the target upon receipt. 


An interrupt request packet is sent by processors or I/O devices through the interrupt vector 
dispatch mechanism and is received by the specified target through the interrupt vector 
receive mechanism. Upon receipt of an interrupt request packet, a special trap is invoked on 
the target processor. The trap handler software invoked in the target processor then schedules 
the interrupt request to itself by posting the interrupt into SOFTINT register at the desired 
interrupt level. 


Note that the processor may not send an interrupt request packet to itself through the 
interrupt dispatch mechanism. Separate sets of dispatch (outgoing) and receive (incoming) 
interrupt data registers allow simultaneous interrupt dispatching and receiving. 

Different aspects of interrupt handling are described in the following sections: 

- Interrupt Vector Dispatch 

- Interrupt Vector Receive 

- Interrupt Global Registers 

- Interrupt ASI Registers 

- Software Interrupt Register (SOFTINT) 
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10.1 Interrupt Vector Dispatch 


To dispatch an interrupt or cross-call, a processor or I/O device first writes to the outgoing 
Interrupt Vector Data Registers according to an established software convention, described 
below. A subsequent write to the Interrupt Vector Dispatch Register triggers the interrupt 
delivery. The status of the interrupt dispatch can be read by polling the 

ASI INTR DISPATCH STATUS BUSY and NACK bits. A MEMBAR #Sync should be used 
before polling begins to ensure that earlier stores are completed. CODE EXAMPLE 10-1 shows 
the pseudo-code sequence that sends an interrupt. 





BUSY and NACK bits of the Interrupt Vector Dispatch Status Register, listed in TABLE 10-1, 
indicate the status of the interrupt dispatched. 


TABLE 10-1 BUSY and NACK Bits of Interrupt Vector Dispatch Register 


BUSY NACK Status 

| 

1 0 Interrupt dispatch pending 

: 
The ASI INTR DISPATCH, STATUS Register contains four pairs of BUSY/NACK bit pairs 
enabling interrupts to be pipelined. Specifying a unique pair of BUSY/NACK bits used for 


each interrupt when writing, the Interrupt Dispatch Register enables up to four interrupts to 
be outstanding at one time. 














Note — The processor may not send an interrupt vector to itself through outgoing interrupt 
vector data registers. Doing so causes undefined interrupt vector data to be returned. 


CODE EXAMPLE 10-1 Code Sequence for Interrupt Dispatch 





Read state of ASI INTR DISPATCH STATUS; Error if BUSY 


«no pending interrupt dispatch packet» 














Begin atomic sequence (PSTATE.IE «€ 0) 

at ASI INTR W, VA-0x40 
at ASI INTR W, VA=0x48 
at ASI INTR W, VA-0x50 


( 
S 0 (optional) 
S 1 ( 
S 2 ( 
Store to IV data reg 3 at ASI INTR W, VA-0x58 ( 
S 4 ( 
S 5 ( 
S 6 ( 


tore to IV data reg 
tore to IV data reg ptional) 
tore to IV data reg ptional) 
at ASI INTR W, VA-0x60 
at ASI INTR W, VA-0x68 
at ASI INTR W, VA-0x80 


tore to IV data reg ptional) 


tore to IV data reg ptional) 


























o 
o 
o 
optional) 
o 
o 
o 





tore to IV data reg ptional) 
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CODE EXAMPLE 10-1 Code Sequence for Interrupt Dispatch (Continued) 


Store to IV data reg 7 at ASI INTR W, VA-0x88 


Store to IV dispatch at ASI_INTR_W, VA<63:29>=0, 


VA<28:24>=BUSY/NACK bit #,VA<23:14>=ITID, 


VA<13:0>=0x70 initiates interrupt delivery 


Membar #Sync 
Poll state of ASI INTR DISPATCH STATUS 
Loop if BUSY 











Pr. 


!NACK 
after random delay if NACKED) 











End atomic sequence(PSTATE.IE & 1 
DONE if 
(Retry 
Until DONE 








(wait for stores to finish) 


(optional) 


(BUSY, NACK) 





Note — To avoid deadlocks, enable interrupts for some period before retrying the atomic 


sequence. Alternatively, implement the atomic sequence with locks without disabling 


interrupts. 








10.2 


Interrupt Vector Receive 


When an interrupt is received, all eight Interrupt Data Registers are updated, regardless of 
which are being used by software. This update is done in conjunction with the setting of the 
BUSY bit in the AST, INTR R 











ECEIV 

















E register. At this point, the processor inhibits further 


interrupt packets from the system bus. If interrupts are enabled (PSTATE.IE - 1),thenan 














interrupt trap (trap type 6016) is generated. Software reads the AST INTR RECEIV 
register and Incoming Interrupt Data Registers to determine the entry point of the appropriate 
trap handler. All of the external interrupt packets are processed at the highest interrupt 
priority level and are then reprioritized as lower-priority interrupts in the software handler. 


CODE EXAMPLE 10-2 illustrates interrupt receive handling. 


CODE EXAMPLE 10-2 Code Sequence for an Interrupt Receive 


Read 
Read 
Read 
Read 
Read 
Read 





Read 
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state of ASI INTR REC 


from 
from 
from 
from 
from 


from 


IV 
IV 
IV 
IV 
IV 
IV 


data 
data 
data 
data 
data 
data 


EIVE; Error if 
reg 0 at ASI SDB INTR. 
reg 1 at ASI SDB INTR. 
reg 2 at ASI SDB INTR. 
reg 3 at ASI SDB INTR. 
reg 4 at ASI SDB INTR. 
reg 5 at ASI SDB INTR. 





























!BUSY 


, 


, 


` 


` 





R 
R 
R 
R 
R 
R 


, 


Interrupt Handling 


VA=0 
VA=0 
VA=0 
VA=0 


VA= 


VA=0 








x40 
x48 
x50 
x58 


0x60 


x68 





" 








tional) 
tional) 
tional) 
tional) 


tional) 





tional) 
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CODE EXAMPLE 10-2 Code Sequence for an Interrupt Receive (Continued) 


Read from IV data reg 6 at ASI_SDB_INTR_R, VA=0x80 (optional) 





Read from IV data reg 7 at ASI_SDB_INTR_R, VA-0x88 (optional) 
Determine the appropriate handler 
Handle interrupt or reprioritize this trap and 


set the SOFTINT register 
Store zero to ASI_INTR_RECEIVE to clear the BUSY bit 




















10.3 Interrupt Global Registers 


A separate set of global registers is implemented to expedite interrupt processing. As 
described in Section 10.2, “Interrupt Vector Receive”, the processor takes an interrupt trap 
after receiving an interrupt packet. Software uses a number of scratch registers while 
determining the appropriate handler and constructing the interrupt state. 


A separate set of eight Interrupt Global Registers (IGRs) replaces the eight 
programmer-visible global registers during interrupt processing. After an interrupt trap is 
dispatched, the hardware selects the interrupt global registers by setting the PSTATE.IG 
field. The previous value of PSTATE is restored from the trap stack by a DONE or RETRY 
instruction on exit from the interrupt handler. 























10.4 Interrupt ASI Registers 


MEMBAR #Sync is generally needed after stores to interrupt ASI registers, which avoids 
unnecessary effects caused by possible prefetches to the locations with side effect. 





10.4.1 Outgoing Interrupt Vector Data<7:0> Register 









































ASI_INTR_DATAO_W (data 0): ASI = 7716, VA<63:0> = 4046 
ASI_INTR_DATA1_W (data 1): ASI = 7716, VA<63:0> = 48,6 
ASI INTR DATA2 W (data 2): ASI = 7716, VA<63:0> = 5046 
ASI INTR DATA, W (data 3): ASI = 77,6, VA<63:0> = 5816 
ASI INTR DATAA W (data 4): ASI = 77,6, VA<63:0> = 6016 
ASI INTR DATAS, W (data 5): ASI = 77,6, VA<63:0> = 6846 
ASI INTR DATAO, W (data 6): ASI = 77,6, VA<63:0> = 8016 
ASI INTR DATA7 W (data 7): ASI = 77,6, VA<63:0> = 8816 
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10.4.2 


Bits 


Name: ASI INTR DATA W: Outgoing Interrupt Vector Data Registers (Privileged, Write- 
only) 


TABLE 10-2 describes the register field of the eight Outgoing Interrupt Vector Data Registers. 


TABLE 10-2 Outgoing Interrupt Vector Data Register Format 





Bits Field Type Description 





63:0 Data W Interrupt data 

















A write to these eight registers modifies the outgoing Interrupt Dispatch Data Registers. 


Non-privileged access to this register causes a privileged action trap. An attempt to read this 
register causes a data access exception trap. 


Interrupt Vector Dispatch Register 


ASI 7716 


VA<63:19> = 0 
VA<18:14> = Target Processor ID 
VA<13:0> = 7016 


Name: ASI INTR W (Interrupt dispatch, Privileged, Write-only) 


TABLE 10-3 describes the fields of the Interrupt Vector Dispatch Register. 


TABLE 10-3 Interrupt Vector Dispatch Register Format 





Field Type Description 





VA<18:14> 








ITID W Interrupt Target ID. Specifies the interrupt target processor using the BUS Y/ 


NACK bit pair BN, along with the contents of the eight Interrupt Vector Data 
Registers. VA<15:14> specifies which of the BUSY /NACK bit pairs to use for 
the interrupt (the lower two bits of Agent/Target ID are direct mapped to BN#). 


0x0 in this field selects BUSY /NACK bits 
ASI INTR DISPATCH STATUS«I:0». 
0x1 in this field selects BUSY /NACK bits 
ASI INTR DISPATCH STATUS«3:2». 
0x2 in this field selects BUSY /NACK bits 
ASI INTR DISPATCH STATUS«S5:4». 


0x3 in this field selects BUSY /NACK bits 
ASI INTR DISPATCH STATUS<7:6>. 


















































If there are more than four processors in the system, software must take care of 
aliasing caused by direct mapping of the lower two bits of AGENT IDs. 
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10.4.3 


A write to this ASI triggers an interrupt vector dispatch to the target processor identified with 
Interrupt Target ID (ITID), using BUSY/NACK bit pair BN along with the contents of the 
eight Interrupt Vector Data Registers. Note that the write acts as a trigger; however, the data 
for the write is ignored. 


A read from the Interrupt Vector Dispatch Register causes a data_access_exception trap. 
Non-privileged access to this register causes a privileged_action trap. 


Interrupt Vector Dispatch Status Register 


ASI 4816 
VA<63:0> = 0 
Name: ASI INTR DISPATCH STATUS (Privileged, Read-only) 


TABLE 10-4 describes the fields of the Interrupt Vector Dispatch Status Register. 


TABLE 10-4 Interrupt Dispatch Status Register Format 





























Bits Type Description 

«63:87 Reserved, read as 0. 

1,3,5,7 NACK R Set if interrupt dispatch has failed. Cleared at the start of every interrupt dispatch 

attempt; set when a dispatch has failed. 

0,2,4,6 BUSY R Set when there is an outstanding dispatch. 
In the UItraSPARC IIIi processor, four BUSY/NACK pairs are implemented in the Interrupt 
Vector Dispatch Status Register. 
The status of up to four outgoing interrupts can be read from 
ASI INTR DISPATCH STATUS BUSY/NACK bits. This register contains up to 4 pairs of 
BUSY/NACK bit pairs: the pairs at «1:0», «3:2», «5:47, and «7:6» are referred to as pair 0, 
pair 1, pair 2, and pair 3, respectively. 
The VA<15:14> field of the Interrupt Dispatch Register specifies which BUSY/NACK bit pair 
will be used for the interrupt. 
Writes to this ASI cause a data access exception trap. Non-privileged access to this register 
causes a privileged action trap. 
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10.4.4 


10.4.5 


Incoming Interrupt Vector Data<7:0> 


ASI I 
ASI I 
ASI I 
ASI I 
ASI I 
ASI I 
ASI I 
ASI I 





TR 
TR 
TR 
TR 
TR 
TR 
TR 





TR 


_R (data 0): ASI = 7F16, VA<63:0> = 4046 
_R (data 1): ASI = 7F 16, VA<63:0> = 4816 
_R (data 2): ASI = 7F16, VA<63:0> = 5046 
_R (data 3): ASI = 7F1¢, VA<63:0> = 5846 
_R (data 4): ASI = 7F 16, VA<63:0> = 6016 
_R (data 5): ASI = 7F16, VA<63:0> = 6816 
_R (data 6): ASI = 7F16, VA<63:0> = 8046 





_R (data 7): ASI = TF 16> VA<63:0> = 8816 


Name: ASI INTR R (Privileged, Read-only) 


TABLE 10-5 describes the register field of the eight Incoming Interrupt Vector Data Registers. 


TABLE 10-5 Incoming Interrupt Vector Data Register Format 





Bits 
63:0 








Field Type Description 








Data R Interrupt data 





A read from these registers returns incoming interrupt information from the incoming 
Interrupt Receive Data Registers. 


Non-privileged access to this register causes a privileged action trap. 


Interrupt Vector Receive Register 


ASI 4916 


VA<63:0> = 0 





Name: ASI_INTR_RECEIVE (Privileged) 














TABLE 10-6 describes the fields of the Interrupt Receive Register. 


TABLE 10-6 Interrupt Receive Register Format 














Bits Field Type Description 

63:6 -- R Reserved. Read as 0. 

5 BUSY RW Set when an interrupt vector is received. The BUSY bit must be cleared by software 
writing zero. 

4:0 SOURCE R Source ID of Interrupter. Accurate when BUSY is set. Source ID is the AID field of 
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the interrupting agent. 





Interrupt Handling 221 


The status of an incoming interrupt can be read from AST_INTR_RECEIVE. The BUSY bit 

















is cleared by writing zero to this register. BUSY bit is also cleared during Power-on Reset. 


Non-privileged access to the Interrupt Vector Receive Register causes a privileged_action 


trap. 





10.5 Software Interrupt Register (SOF TINT) 


To schedule interrupt vectors for processing at a later time, each processor can send itself 
signals by setting bits in the SOF TINT register. 


The SOFTINT register (ASR 1616), described in TABLE 10-7, is used for communication from 


nucleus (TL > 0) code to kernel (TL = 0) code. Interrupt packets and other service requests 


can be scheduled in queues or mailboxes in memory by the nucleus, which then sets 


SOF TINT<n> to cause an interrupt at level <n>. 


TABLE 10-7 SOFTINT Register Format 


Bits 


Field 


Description 


RW 





<16> 


STICK_INT 


System Timer interrupt. 


When the STICK CMPR INT DIS field is cleared (that is, 
STICK interrupt is enabled) and the 63-bit 

STICK Compare Register's STICK CMPR field matches 
the STICK Register's counter field, the STICK INT field is 
set and a software interrupt is generated. 


RW 





<15:1> 


SOFTINT<15:1> 


When set, bits<15:1> cause interrupts with each bit 
corresponding to levels IRL<15:1>, respectively. 


RW 





<0> 








TICK_INT 





Timer interrupt. 


When TICK_CMPR’s INT_DIS field is cleared (that is, 
TICK interrupt is enabled) and the 63-bit TICK_Compare 
Register’s TICK_CMPR field matches the TICK Register’s 
counter field, the TICK_INT field is set and a software 
interrupt is generated. 


Non-privileged access to this register causes a privileged_opcode trap. 


222 


UltraSPARC Illi Processor User's Manual * June 2003 





RW 





10.5.1 Setting the Software Interrupt Register 


Setting SOFTINT<n> is done by a write to the SET_SOFTINT register (ASR 14,6), with 
bit n corresponding to the interrupt level set. The value written to the SET_SOFTINT 
register is effectively ORed into the SOFTINT register. This approach allows the interrupt 
handler to set one or more bits in the SOFTINT register with a single instruction. 











Read accesses to the SET SOFTINT register cause an illegal instruction trap. Non-privileged 
accesses to this register cause a privileged opcode trap. 














When the nucleus returns, if (PSTATE.IE = 1) and (n > PIL), then the processor will 
receive the highest-priority interrupt IRL<n> of the asserted bits in SOFTINT<16:0>. The 
processor then takes a trap for the interrupt request, and the nucleus sets the return state to 
the interrupt handler at that PIL and returns to TL = 0. In this manner, the nucleus can 
schedule services at various priorities and process them according to their priority. 


10.5.2 Clearing the Software Interrupt Register 


When all interrupts scheduled for service at level n have been serviced, the kernel writes to 
the CLEAR. SOFTINT register (ASR 15,6) with bit n set, to clear that interrupt. The 
complement of the value written to the CLEAR. SOFTINT register is effectively ANDed 
with the SOFTINT register. This approach allows the interrupt handler to clear one or more 
bits in the SOFTINT register with a single instruction. 














Read accesses to the CLEAR. SOFTINT register cause an i/legal instruction trap. Non- 
privileged write accesses to this register cause a privileged opcode trap. 


The timer interrupt TICK. INT and system timer interrupt STICK INT are equivalent to 
SOFTINT<14> and have the same effect. 


Note — To avoid a race condition between the kernel clearing an interrupt and the nucleus 
setting it, the kernel should examine the queue for any valid entries again after clearing the 
interrupt bit. 


TABLE 10-8 summarizes the SOFTINT ASRs. 


TABLE 10-8 SOFTINT ASRs 





ASR Value ASR Name Type Description 


SET, SOFTINT Sets bit(s) in Soft Interrupt Register. 





CLEAR SOFTINT Clears bit(s) in Soft Interrupt Register. 





SOFTINT Per-processor Soft Interrupt Register. 
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CHAPTER 1 1 





Performance Instrumentation 





Performance instrumentation consists of processor event counters that can be used to gather 
statistics during program execution. Approximately 70 events can be monitored, two at a 
time, to gain information about the performance of the processor. Cache miss counts and stall 
times, for example, can be measured using two, 32-bit Performance Instrumentation Counters 
(PICs). Some event counting can be synthesized from the event counters available to provide 
additional program execution statistics. 


The counters can be monitored during program execution to gather on-going statistics or 
reconfigure during steady-state program execution to gather statistics for more than two 
events. 


The Performance Control Register (PCR) is used to select the events to monitor and provide 
control for counting in privileged and/or non-privileged modes. 


Each of the two 32-bit performance instrumentation counters (PIC), PICL, and PICU, can 
accumulate over four billion events before wrapping. Event logging counts can be extended 
by periodically reading contents of the performance instrumentation counters to detect and 
avoid an overflow. An interrupt can be enabled on a counter overflow. Additional event or 
stall cycle statistics can be collected by reading the PIC counts between repeated program 
executions. 

This chapter describes the performance instrumentation features in the following sections: 

- Section 11.1, “Performance Control Register (PCR)” 

- Section 11.2, “Performance Instrumentation Counter (PIC) Register" 

- Section 11.3, “Performance Instrumentation Operation" 

- Section 11.4, “Pipeline Counters" 

- Section 11.5, “Cache Access Counters" 

- Section 11.6, *Memory Controller Counters" 

- Section 11.7, “Miscellaneous Counters” 


- Section 11.8, “PCR.SL and PCR.SU Encodings” 
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Supervisor/User Mode 


Access to the PCR is restricted to supervisor software. User software accessing the PCR 
causes a privileged_opcode trap. 


Supervisor software controls user accessibility to the PIC counters through the PCR. PRIV 
field. When PCR.PRIV = 1 (supervisor access only), an attempt by user software to access 
the PIC register causes a privileged action trap. By default, PCR. PRIV - 0. In this default 
state, the PIC register is accessible to user software. 


In Supervisor/User configuration, the mode in which the counters are enabled to count is 
controlled by setting the PCR. UT (User Trace) and PCR. ST (System Trace) bits. 





11.1 
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Performance Control Register (PCR) 


The 64-bit PCR and PIC are accessed through read/write Ancillary State Register (ASR) 
instructions (RDASR/WRASR). PCR and PIC are located at ASRs 16 (10,6) and 17 (11,5), 
respectively. 


Two events can simultaneously be measured by setting the PIC_SL and PIC_SU fields. The 
counters can be enabled separately for Supervisor and User mode using UT and ST fields. 
The selected statistics are reflected during subsequent accesses to the PICs. 


The PCR is a read/write register used to control the counting of performance monitoring 
events. FIGURE 11-1 shows the details of the PCR and TABLE 11-1 describes the various fields 
of the PCR. Counts are collected in the PIC register (see Section 11.2 “Performance 
Instrumentation Counter (PIC) Register” on page 230”). 


PCR - Performance Control Register ASR Register 





The PCR selects the events and controls the operating modes of the Performance Instrumentation 
Counters (PICs). 





ASR 1646 64-bit Read/Write Privileged Mode, otherwise Reset: 
privileged_action trap. 0x0000.0000 








FIGURE 11-1 Performance Control Register 
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63 62 61 60 59 58 57 56 55 54 53 52 51 50 49 48 47 46 45 44 43 42 4140 39 38 37 36 35 34 33 32 
arch reserved mP reserved 


31 30 29 28 27 26 25 24 23 22 21 20 19 18 17 16 15 14 13 12 1110 98 7 6 5 4 3 2 1 


22777890277 000 DOC || «. [Ill] 





arch reserved _ 4 mP reserved A 


UT (user trace) 
ST (supervisor trace) 
PRIV (privileged) 


TABLE 11-1 PCR Bit Description 





Field 


Description 


Selects 1 of up to 64 counters accessible in the upper half (bits <63:32>) of 
the PIC register. 





Selects 1 of up to 64 counters accessible in the lower half (bits <31:0>) of 
the PIC register. 





User Trace Enable. 
If set to one, counts events in non-privileged mode (User). 








System Trace Enable. 

If set to one, counts events in privileged mode (Supervisor). 

Notes: 

If both PCR.UT and PCR. ST are set to one, all selected events are counted. 
If both PCR. UT and PCR. ST are zero, counting is disabled. 

PCR.UT and PCR.ST are global fields which apply to both PIC pairs. 





PRIV 


Privileged. If PCR. PRIV =1, a non-privileged (PSTATE.PRIV - 0) 
attempt to access PIC (via a RDPIC or WRPIC instruction) will result in a 
privileged action exception. 





Reserved by SPARC architecture. 
Read zero, Write zero, or Write value read previously. 
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Unused in the UltraSPARC IIli processor. 
Read zero, Write zero, or Write value read previously. 
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112 Performance Instrumentation Counter (PIC) 
Register 


The difference between the values read from the PIC on two reads reflects the number of 
events that occurred between register reads. Software can only rely on read-to-read PIC 
accesses to get an accurate count and not a write-to-read of the PIC counters. Every time the 
select values (PCR.SU or PCR.SL) are changed, the PIC register is reset and starts counting 
from zero. If there is a context switch, it is the responsibility of software to save the previous 
PCR and PIC values. FIGURE 11-2 shows the details of the PIC and TABLE 11-2 describes the 
various fields of the PIC. 


PIC - Performance Instrumentation Counter register ASR Register 





The PIC register provides access to the counter values for the two events being monitored. 


64-bit Read/Write Accessibility depends on Reset: 
Note: Writes are PCR.PRIV bit: 0x0000.0000 


designed for 0 = accessible in any mode 
diagnostic and test 1 = accessible in Supervisor Mode, 
purposes. otherwise privileged action trap 


63 62 61 60 59 58 57 56 55 54 53 52 51 50 49 48 47 46 45 44 43 42 4140 39 38 37 36 35 34 33 32 
PICU 


31 30 29 28 27 26 25 24 23 22 21 20 19 18 17 16 15 14 13 12 11 10 98 7 6 5 4 3 2 1 


PICL 


FIGURE 11-2 Performance Instrumentation Counter Register 


TABLE 11-2 PIC Register Fields 








Bit Field Description 





63:32 PICU 32-bit field representing the count of an event selected by the SU field of 
the Performance Control Register (PCR) 





31:0 PICL 32-bit field representing the count of an event selected by the SL field of 
the Performance Control Register (PCR) 
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11.2.1 PIC Counter Overflow Trap Operation 


When a PIC counter overflows, an interrupt is generated as described in TABLE 11-3. 


TABLE 11-3 PIC Counter Overflow Processor Compatibility Comparison 


Function Description 





On overflow, a counter wraps to zero, SOFTINT register bit 15 is set to one, and an 
interrupt level 15 trap (a disrupting trap). The counter overflow trap is triggered on 


PIC Connie: the transition from value FFFF FFFF; to value 0. 

















OHIO The point at which the interrupt is delivered may be several instructions after the 
instruction responsible for the overflow event. This situation is known as a “skid.” 
11.3 Performance Instrumentation Operation 


shows how an operating system might use the performance instrumentation features to 
provide event monitoring services. Setup the PCR register as desired to select two events and 
in which modes data should be collected. The monitoring must consider the real effects of 
the computer that includes calls to the system and interrupts. When used, the PCR register is 
considered part of a process state and must be saved and restored when switching process 
contexts. 


Multiple data collection times can be done while the program executes to show on-going 
statistics. 


11.3.1 Gathering Data for More Than Two Events 


When more than two events need to be monitored, the program, program sequence, or 
program loop need to be run again with the new events enabled. It is not possible to monitor 
more than two events at any given time. 


11.3.2 Gathering Data in Privileged and Non-Privileged 
Modes 


The PCR has mode bits to enable the counters in privileged mode, non-privileged mode, or to 
count when in either mode. The mode setting affects both counters. 
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FOR ILLUSTRATIVE 
PURPOSES ONLY 





qu 


set up PCH context switch to B 








hi select value — PCR.SU PCR — [savePCR] 
low select value > PCR.SL 

[0,1] ^ PCR.UT PIC — [savePIC] 
[0,1] > PCR.ST PIC — r[rd] 
[0,1] > PCR.PRIV 
0—PIC 

PIC > r[rd] v 


eh switch to context B 


























accumulate stat 
in PIC 











I back to context A 














Y 


context switch to A 





PIC — r[rd] 











[savePCR] — PCR 
Context [savePIC] > PIC 
Switch PIC > rfrd] 
































FIGURE 11-3 Operational Flow Diagram for Controlling Event Counters 
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11.3.3 


11.3.4 


Performance Instrumentation Implementations 


Counting events and cycle stalls are sometimes complex because of the dynamic conditions 
and cancelled activities. 


Performance Instrumentation Accuracy 


The performance instrumentation counters are designed to provide reasonable accuracy 
especially when used to count hundreds or thousands of events or stall cycles or when 
comparing the PIC counts that have recorded a similar number of events or stall cycles. 
Accuracy is most challenging when trying to associate an event to an instruction and when 
comparing PIC counts with one count rarely occurring. 


When using the overflow trap, it is sometimes difficult to pinpoint the instruction that is 
responsible for the overflow because of the way the pipeline is designed. A delay of several 
instructions is possible before the overflow is able to stop the current instruction flow and 
fetch the trap vector. This delay is referred to as skid and can occur for dozens of clock 
cycles. The skid for the load miss detection case is small. The skid value cannot be measured 
and its length depends on what event or stall cycle is being measured and what other 
instructions are in the pipeline. 





11.4 


11.4.1 


Pipeline Counters 


Instruction Execution and Processor Clock Counts 


The instruction execution count monitors are described in TABLE 11-4 for clock and 
instruction execution counts. 


TABLE 11-4 Instruction Execution Clock Cycles and Counts 





Counter Description 





Cycle cnt [PICL 00.0000 and PICU 00.0000] 

Counts clock cycles. This counter increments the same as the SPARC-V9 
TICK register, except that cycle counting is controlled by the PCR. UT and 
PCR.ST fields. 


Instr cnt [PICL 00.0001 and PICU 00.0001] 
Counts the number of instructions completed. Annulled, mispredicted, or 
trapped instructions are not counted. 
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Synthesized Clocks Per Instruction (CPI) 


The cycle and instruction counts can be used to calculate the average number of instructions 
completed per cycle: Clock cycles per instruction, CPI =Cycle_cnt /Instr_cnt. 


11.4.2 IIU Event Counts 


The counters listed in TABLE 11-5 record branch prediction event counts for taken and 
untaken branches in the Instruction Issue Unit (IIU). A retired branch in the following 
descriptions refers to a branch that reaches the D-stage without being invalidated. 


TABLE 11-5 Counters for Collecting IIU Statistics 





Counter Description 





IU Stat Br miss taken [PICL 01.0101] Counts retired branches that were 
predicted to be taken, but in fact were not taken. 











IU Stat Br miss  untaken [PICU 01.1101] Counts retired branches that were 
predicted to be untaken, but in fact were taken. 





IU Stat, Br Count taken [PICL 01.0110] Counts retired taken branches. 

















IU Stat Br Count, untaken |[PICU 01.1110] Counts retired untaken branches. 





11.4.3 IIU Dispatch Stall Counts 


IIU stall counts, listed in TABLE 11-6 on page 235, are the major cause of pipeline stalls 
(bubbles) from the instruction fetch and decode pipeline. Stalls are counted for each clock 
cycle at which the associated condition is true. 


FIGURE 11-4 illustrates the first two considerations described in Section 11.4.3.1. 


11.4.3.1 Dispatch Counter Considerations 


1. Dispatch Counters count when the buffer is empty, regardless of whether the execution 
pipeline can accept more instructions from the instruction queue. 


2. It is difficult to associate an empty queue. Various reasons taken together or separately can 
cause the instruction queue to be empty. The hardware picks the most recent disruptive event 
that is in the Fetch Unit to choose a counter to assign the empty queue cycles. 
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3. Count accuracy is also subject to the conditions described for all counters in the 
Section 11.3.4 “Performance Instrumentation Accuracy” on page 233.” 


Dispatch Counter Considerations 








Fetch Unit 





|» Instruction |»! Execution 
Queue Pipeline 
































Stall Cycles 











Stall Cycles due to incoming delays are determined from the dispatch 
counters that count clock cycles when the queue is empty (empty cycles). 


Dispatch Counters 

















FIGURE 11-4 Dispatch Counters 


TABLE 11-6 Counters for IIU Stalls 





Counter 


DispatchO_IC_miss 


Description! 


[PICL 00.0010] Counts the stall cycles due to the event that no 
instructions are issued because I-queue is empty from instruction cache 
miss. This count includes L2-cache miss processing if a L2-cache miss 
also occurs. 





Dispatc 


h0_mispred 


[PICU 00.0010] Counts the stall cycles due to the event that no 
instructions are issued because I-gueue is empty due to branch 
misprediction. 





Dispatc 


h0_br_target 


[PICL 00.0011] Counts the stall cycles due to the event that no 
instructions are issued because I-gueue is empty due to a branch target 
address calculation. 








Dispatc 





no nd br 








[PICL 00.0100] Counts the stall cycles due to the event of having 
two branch instructions line-up in one 4-instruction group causing the 
second branch in the group to be refetched, delaying its entrance into 
the I-gueue. 








Dispatch_rs_mispred 





[PICL 01.0111] Counts the stall cycles due to the event that no 
instructions are issued because the I-gueue is empty due to a Return 
Address Stack misprediction. 


1. See Section 11.4.3.1 “Dispatch Counter Considerations” on page 234 for important 
information. 
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11.4.4 


11.4.5 


236 


R-stage Stall Counts 


Stalls are caused by dependency checks (data not ready for use by the instruction ready for 
dispatch) and by resources not being available (out-of-pipeline execution units needed, but 


are in-use). 


The counters in TABLE 11-7 count the stall cycles at the R-stage of the pipeline. Stalls are 
counted for each clock at which the associated condition is true. 


TABLE 11-7 Counters for R-stage Stalls 


Counter 


Description 





Rstall_storeQ 


[PICL 00.0101] Counts R-stage stall cycles for a store instruction which is 
the next instruction to be executed, but is stalled due to the store queue being 
full, that is, cannot hold additional stores. Up to eight entries can be in the store 
queue. 





Rstall_FP_use 


[PICU 00.1011] Counts R-stage stall cycles due to the event that the next 
instruction to be executed depends on the result of a preceding floating-point 
instruction in the pipeline that is not yet available. 





Rstall IU use 








[PICL 00.0110] Counts R-stage stall cycles due to the event that the next 
instruction to be executed depends on the result of a preceding integer 
instruction in the pipeline that is not yet available. 


Recirculation Stall Counts 


Recirculation instrumentation is implemented through the counters listed in TABLE 11-8. 


TABLE 11-8 Counters for Recirculation 


Counter 


Re DC missovhd 


Description 


[PICU 00.0100] Counts the stall cycles from when a D-cache load 
misses (causes a recirculation), but L2-cache hit/miss has not been reported. 
Counts portion/overhead of stall cycles due to D-cache load miss from the 
point the load reaches D-stage (about to be recirculated) to the point L2-cache 
hit/miss for the load is reported. 





Re endian miss 


[NA] Event counter does not exist in the UltraSPARC Ili processor. 





Re RAW miss 


[PICU 10.0110] Counts stall cycles due to recirculation when there is a 
load in the E-stage which has a non-bypassable read-after-write (RAW) hazard 
with an earlier store instruction. This condition means that load data are being 
delayed by completion of an earlier store. See the Section 8.12 "Read After 
Write (RAW) Bypassing" on page 197" for a description of the RAW hazard 
and causes of recirculation. 





Re FPU bypass 


[PICU 00.0101] Counts stall cycles due to recirculation when a FPU 
bypass condition that does not have a direct bypass path occurs. 








Re DC miss 


[PICU 00.0110] Counts stall cycles due to loads that miss D-cache and 
L2-cache and get recirculated. Includes cacheable loads only. 
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TABLE 11-8 Counters for Recirculation (Continued) 


Counter Description 
Re_EC_miss [PICU 00.0111] Counts stall cycles due to loads that miss D-cache and 


L2-cache and get recirculated. Stall cycles from the point when L2-cache miss 


is detected to the D-stage of the recirculated flow are counted. Includes 
cacheable loads only. 





Re_PC_miss [PICU 01.0000] Counts stall cycles due to recirculation when a P-cache 
miss occurs on a prefetch predicted second load. 











1. See Section 11.5.6 “Separating D-cache Stall Cycle Counts” on page 240. 





11.5 Cache Access Counters 


Instruction cache, data cache, prefetch cache, write cache, and L2-cache access events can be 
collected through the counters listed in TABLE 11-9. Counts are updated by each cache access, 
regardless of whether the access will be used. 


11.5.1 Instruction Cache Events 


TABLE 11-9 Counters for Instruction Cache Events 


Counter Description 





[PICL 00.1000] Counts I-cache references. I-cache references are 
fetches (up to four instructions) from an aligned block of eight 
instructions. I-cache references are generally speculative and include 
instructions that are later cancelled due to mis-speculation. 




















IC_miss [PICU 00.1000] Counts I-cache misses. Includes fetches from 
mis-speculated execution paths which are later cancelled. 
IC miss cancelled |[PICU 00.0011] Counts I-cache misses cancelled due to 
mis-speculation, recycle, or other events. 
ITLB miss [PICU 01.0001] Counts I-TLB miss traps taken. 
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11.5.2 Data Cache Events 


TABLE 11-10 describes the counters for D-cache events. 


TABLE 11-10 Counters for Data Cache Events 


Counter Description 





DC_rd PICL 00.1001] Counts D-cache read references (including 
accesses that subsequently trap). References to pages that are not 
virtually cacheable (TTE CV bit = 0) are not counted. 





DC_rd_miss PICU 00.1001] Counts recirculated loads that miss the D-cache. 
Includes cacheable loads only. 





DC_wr PICL 00.1010] Counts D-cache cacheable store accesses 
encountered (including cacheable stores that subsequently trap). 
Non-cacheable accesses are not counted. 





DC_wr_miss PICU 00.1010] Counts D-cache cacheable store accesses that miss 
D-cache. (There is no stall or recirculation on store miss.) 




















DTLB_miss PICU 01.0010] Counts memory reference instructions which trap 


due to a D-TLB miss. 








11.5.3 Write Cache Events 


TABLE 11-11 describes the counters for W-cache events. 


TABLE 11-11 Counters for Write Cache Events 




















Counter Description 

WC_miss [PICU 01.0011] Counts W-cache misses. 

WC_snoop_cb [PICU 01.0100] Counts W-cache copybacks generated by a snoop 
from a remote processor. 

WC_scrubbed [PICU 01.0101] Counts W-cache hits to clean lines. 

WC_wb_wo_read [PICU 01.0110] Counts W-cache writebacks not reguiring a read. 
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11.5.4 Prefetch Cache Events 


TABLE 11-12 describes the counters for P-cache events. 


TABLE 11-12 Counters for Prefetch Cache Events 









































Counter Description 

PC_MS_miss PICU 01.1111] Counts FP loads through the MS pipeline that miss 
P-cache. 

PC_soft_hit PICU 01.1000] Counts FP loads that hit a P-cache line that was 
prefetched by a software-prefetch instruction. 

PC_hard_hit PICU 01.1010] Counts FP loads that hit a P-cache line that was 
prefetched by a hardware prefetch. 

PC_snoop_inv PICU 01.1001] Counts P-cache invalidates generated by a snoop 
from a remote processor and stores by a local processor. 

PC porto rd PICL 01.0000] Counts P-cache cacheable FP loads to the first port 
(general-purpose load path to D-cache and P-cache via MS pipeline). 

PC portl rd [PICU 01.1011] Counts P-cache cacheable FP loads to the second 


port (memory and out-of-pipeline instruction execution loads via the AO 
and A1 pipelines). 








11.5.5 L2-Cache Events 


The L2-cache write hit count is determined by subtraction of the read hit and the instruction 
hit count from the total L2-cache hit count. The L2-cache write reference count is determined 
by subtraction of the D-cache read miss and I-cache misses from the total L2-cache 
references. Because of write caching, this is not the same as D-cache write misses. 

TABLE 11-13 describes the counter for L2-cache events. 


Note — A block load or store access is counted as 8 references. For atomics, the read and 
write events are counted individually. 








TABLE 11-13 Counters for L2-cache Events 


Counter Description 


EC_ref [PICL 00.1100] Counts L2-cache reference events. A 64-byte 
request is counted as one reference. Includes speculative D-cache load 
requests that turn out to be a D-cache hit. Count includes cacheable 


accesses only. 





EC_misses [PICU 00.1100] Counts L2-cache miss events sent to the System 
Interface Unit. Includes I-cache, D-cache, P-cache, W-cache exclusive 
(store), read stream (BLD), write stream (BST) requests that miss 


L2-cache. Count includes cacheable accesses only. 
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TABLE 11-13 Counters for L2-cache Events (Continued) 





Counter Description 

EC_write_hit_RDO PICL 00.1101] Counts W-cache exclusive requests that hit 
L2-cache in S or O state and thus, do a read-to-own (RDO) bus 
transaction. 

EC wb PICU 00.1101] Counts dirty subblocks that produce writebacks 


due to L2-cache miss events. 





EC snoop inv PICL 00.1110] Counts L2-cache invalidates generated from a 
snoop by a remote processor. 





EC snoop cb PICU 00.1110] Counts L2-cache copybacks generated from a 
snoop by a remote processor. 








EC rd miss PICL 00.1111] Counts L2-cache miss events (including atomics) 


from D-cache requests. Cacheable D-cache loads only. 


























EC ic miss PICU 00.1111] Counts L2-cache read misses from I-cache 
requests. The counter counts all I-cache misses including those for 
instructions from the mis-speculated execution path. Cacheable requests 
only. 











11.5.6 Separating D-cache Stall Cycle Counts 


The D-Cache stall cycle counts can be measured separately for L2-cache hits and misses by 
using the Re DC missovhd counter. The Re DC missovhd stall cycle counter is used with 
the recirculation and cache access events to separately calculate the D-cache loads that hit 
and miss the L2-cache. TABLE 11-14 describes the Re DC missovhd stall cycle counter 
processor compatibility. 


TABLE 11-14 Re DC missovhd Stall Cycle Counter Processor Compatibility 





Function Description 





The Re DC missovhd cycle stall counter is defined in 


MOD eye MOR O TABLE 11-8 and in the eguations below. 











Synthesizing Individual Hit and Miss Stall Times 


To explain the synthesis for L2-cache hit and miss stall times separately, consider the four 
stall regions A, B, C, and D shown in FIGURE 11-5 and the definitions and calculations that 
follow. 
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D-cache misses to L2-cache 


D-cache load miss — L2-cache Hit/Miss Recirculated load reaches 
at D Pipeline stage is reported D Pipeline stage again 


Yo x 


L2-cache Hit: A 


L2-cache Miss: C 
T2 


Stall Time (clock cycles) 





FIGURE 11-5 D-Cache Load Miss Stall Regions 


Definitions: 
Re DC missovhd (stall cycles) = (A + C) stall cycles 
Re EC miss (stall cycles) = (D) stall cycles 
Re DC miss (stall cycles) = (A + B + C + D) stall cycles 


Fraction of D-cache misses . missL2 .. EC rd miss (events) 


; = = = Miss L2 Ratio 
that miss L2-cache miss D-cache DC. rd miss (events) 








Synthesized Stall Cycle Counts: 
(C) Stall Cycles = Re DC missovhd * Miss L2 Ratio 
L2-cache Miss Stall Cycles = (C + D) = (C) + Re EC miss 
L2-cache Hit Stall Cycles = (A + B) = Re DC miss - (C + D) 
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11.6 Memory Controller Counters 


This section describes the memory controller counters in the UltraSPARC IIIi processor. 
Descriptions of counters for the UltraSPARC IIIi processor memory controller is shown in 
TABLE 11-15. 


TABLE 11-15 Memory Controller Counters 





Counter Description 





C_read_dispatched PICL 10.0000] Counts the number of DDR 
64-byte reads dispatched by the MIU. 





C_write_dispatched PICL 10.0001] Counts the number of DDR 
64-byte writes dispatched by the MIU. 





C_read_returned_to_JBU PICL 10.0010] Counts the number of 64-byte 
reads that return data to JBU. 





C msl busy stall PICL 10.0011] Counts the number of stall 
cycles due to msl busy. 





C mdb overflow stall PICL 10.0100] Counts the number of stall 
cycles due to potential memory data buffer overflow. 





C miu spec request PICL 10.0101] Counts the number of 
speculative requests accepted by MIU. 




















C open bank cmds PICU 10.0000] Counts the number of open 
bank commands sent to the DDR SDRAM. With 
PTB enabled in MCU, this is PTB miss, no entry in 











PTB. 

MC reads [PICU 10.0001] Counts the number of DDR 
64-byte reads by the MSL. 

MC writes [PICU 10.0010] Counts the number of DDR 
64-byte writes by the MSL. 

MC. page close stall [PICU 10.0011] Counts the number of DDR 


page conflicts. When there is already a Page 
Tracking Buffer (PTB) entry, and a different page in 
the same bank needs to be opened, a page close is 
needed before opening a new page. Always zero 
when PTB is disabled. 
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11.7 Miscellaneous Counters 


T171 System Interface Events and Clock Cycles 


System interface statistics are collected through the counters listed in TABLE 11-16. 


TABLE 11-16 Counters for System Interface Statistics 











Counter Description 

SI_snoop [PICL 01.0001] Counts snoops from remote processor(s) including RDS, 
RDO. 

SI_ciq_flow [PICL 01.0010] Counts system clock cycles when the flow control 


(DOK/AOK) is asserted from this processor. 








SI_owned [PICL 010011] Counts the number of times J_PACK indicating OWNED 
is asserted on requests. 











11.7.2 Software Events 


Software statistics are collected through the counters listed in TABLE 11-17. 


TABLE 11-17 Counters for Software Statistics 





Counter Description 





SW_count0 [PICL 01.0100] Counts software-generated occurrences of sethi 
$hi(0xfc000), %g0 instruction. 





SW counti [PICU 01.1100] Counts software-generated occurrences of sethi 
$hi(0xfc000), %g0 instruction. 














Note — Both counters measure the same event; thus, the count can be programmed to be 
read from either the PICL or the PICU register. 
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11.7.3 Floating-Point Operation Events 


Floating-point operation statistics are collected through the counters listed in TABLE 11-18. 


TABLE 11-18 Counters for Floating-Point Operation Statistics 





Event Counter Description 





FA pipe completion |[PICL 01.1000] Counts instructions that complete execution on the 
Floating-Point/Graphics ALU pipelines. 





FM pipe completion |[PICU 10.0111] Counts instructions that complete execution on the 
Floating-Point/Graphics Multiply pipelines. 














11.8 PCR.SL and PCR.SU Encodings 


TABLE 11-19 lists PCR. SL and PCR. SL selection bit field encoding. Shaded blocks show SL 
and SU field duplications. 


TABLE 11-19 PIC.SL and PIC.SU Selection Bit Field Encoding 






























































PCR.SL and 

PCR.SU 

Encodings PICL Event Selection PICU Event Selection 
00.0000 (Gyrn lle) «eant, Cycle Cime 
00.0001 Tasti Cnr Tree cot 
00.0010 Dispatch0 IC miss Dispatch0_mispred 
00.0011 Dispatch0_br_target IC miss_cancelled 
00.0100 Dispatch0_2nd_br Re_DC_missovhd 
00.0101 Rstall_storeQ Re_FPU_bypass 
00.0110 Rstall IU use Re DC miss 
00.0111 Reserved Re EC miss 
00.1000 IC ref IC miss 

00.1001 DC rd DC rd miss 
00.1010 DC wr DC wr miss 
00.1011 Reserved Rstall FP use 
00.1100 EC ref EC misses 
00.1101 EC write hit RDO EC wb 

00.1110 EC snoop inv EC snoop cb 
00.1111 EC rd miss EC ic miss 
01.0000 PC portO0 rd Re PC miss 
01.0001 SI snoop ITLB miss 
01.0010 SI ciq flow DTLB miss 
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TABLE 11-19 PIC.SL and PIC.SU Selection Bit Field Encoding (Continued) 
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PCR.SL and 

PCR.SU 

Encodings PICL Event Selection PICU Event Selection 
01.0011 SI owned WC miss 

01.0100 SW countO0 WC snoop cb 

01.0101 IU Stat Br miss taken WC scrubbed 

01.0110 IU Stat Br count taken WC wb. wo read 
01.0111 Dispatch rs mispred Reserved 

01.1000 FA pipe completion PC soft hit 

01.1001 Reserved PC snoop inv 

01.1010 Reserved PC hard hit 

01.1011 Reserved PC portl rd 

01.1100 Reserved SW countli 

01.1101 Reserved IU Stat Br miss untaken 
01.1110 Reserved IU Stat Br count untaken 
01.1111 Reserved PC MS miss 

10.0000 C read dispatched MC open bank cmds 
10.0001 C write dispatched MC reads 

10.0010 C read returned to JBU MC writes 

10.0011 C msl busy stall MC. page close stall 
10.0100 C mdb overflow stall Reserved 

10.0101 C miu spec request Reserved 

10.0110 Reserved Re RAW miss 

10.0111 Reserved FM pipe completion 
10.1000 Reserved Reserved 

10.1001 Reserved Reserved 

10.1010 - Reserved Reserved 

11.1111 
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CHAPTER 1 2 





Reset and RED state 





The UltraSPARC IIIi processor can be reset using various mechanisms. This section deals 
with the reset and RED. state for the UltraSPARC IIIi processor. 





12.1 RED state Characteristics 


A processor enters RED state by one of the two ways: 


Trapping when already at the maximum trap level 











Setting the PSTATE. RED 





When the processor enters RED state, it will clear the DCU Control Register, including 
enable bits for I-cache, D-cache, I-MMU, D-MMU, and virtual and physical watchpoints. 














Note — Exiting RED state by writing zero to PSTATE. RED in the delay slot of a JMPL 
is not recommended. A non-cacheable instruction prefetch can be made to the JMPL target, 
which may be in a cacheable memory area. This condition could result in a bus error on 
some systems and cause an instruction_access_error trap. The trap can be masked by setting 
the NCEEN bit in the ESTATE_ERR_EN register to zero, but this approach will mask all 
non-correctable error checking. Exiting RED_state with DONE or RETRY avoids the 
problem. 
























































12.2 Resets 


Reset priorities from highest to lowest are Power-On Reset (POR), System Reset, Externally 
Initiated Reset (XIR), Watchdog Reset (WDR), and Software-Initiated Reset (SIR). 
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12:21 


12:2.2 
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Power-On Reset 


A Power-On Reset (POR) occurs when the J POR. L and J. RST L pins are activated and stay 
asserted until the processor is within its specified operating range. During POR, all other 
resets and traps are ignored. POR has a trap type of 1 at physical address offset 0x20. Any 
pending external transactions are canceled. 





After POR, software must initialize values of certain registers and state that is unknown after 
POR. The following bits must be initialized before the caches are enabled: 


In the I-cache, valid bits must be cleared and microtag bits must be set so that each way 
within a set has a unique microtag value. 


In the D-cache, valid bits must be cleared and microtag bits must be set so that each way 
within a set has a unique microtag value. 


All L2-cache tags and data. 
The I-MMU and D-MMU TLBs must also be initialized. 


The P-cache valid bits must be initialized before any floating-point loads are executed. 











Caution — Executing a DONE or RETRY instruction when TSTATE is uninitialized after a 
POR can damage the chip. The POR boot code should initialize TSTATE<3:0>, using wrpr 
writes, before any DONE or RETRY instructions are executed. 























However, these operations can only be executed in privileged mode. Therefore, user code is 
not at risk of damaging the chip. 





System Reset 


A System Reset occurs when the J RST L pin is activated without J POR. L.When this pin 
is active, all other resets and traps are ignored. System Reset has a trap type of 1 at physical 
address offset 0x20. Any pending external transactions are cancelled. 


After a system reset, software must initialize the following bits as unknown: 


In particular, 
The valid and micro-tag bits in the Instruction Cache, 
The valid and micro-tag bits in the D-cache, 
All L2-cache tags and data must be cleared before enabling the caches. 
The I-MMU and D-MMU TLBs must also be initialized. 


Memory refresh continues uninterrupted during a System Reset. System interface, L2-cache 
configuration, memory controller configuration are preserved across a System Reset. 
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12.2: 


12.2.4 


12:2:5 


The JBUS clock ratio is unaffected during this reset. Clock PLLs are reset during a 
Power-On Reset, but not during a System Reset unless the appropriate bit in the CSR is set 
before the System Reset. 


There are bits in JIO that software can write to cause a System Reset, or Power-On Reset at 
any time. CSRs on the UltraSPARC IIli processor that change clock ratios generally do not 
take effect until a System Reset. 


Externally Initiated Reset (XIR) 


An Externally Initiated Reset (XIR) is sent to all processors through the XIR transaction on 
the JBUS. It causes an XIR defined in SPARC-V9, which has a trap type 0x3 at physical 
address offset 0x60. It has higher priority than all other resets except Power-On Reset and 
System Reset. 


This reset (actually a trap) only affects the processors, rather than the entire system. Memory 
state, cache state and most CSR states remain unchanged. 


The saved PC and nPC will only be approximations since the trap is not precise with respect 
to pipeline state. 


Reset due to XIR for the UltraSPARC IIli processor initiates fetch of instruction code from 
Boot PROM, and the memory controller continues to perform refresh cycles in order to 
preserve main memory contents. 


Watchdog Reset (WDR) and error. state 


The processor enters error state when a trap occurs at TL = MAXTL. 


The processor automatically exits error state using WDR. The processor signals itself 
internally to take a WDR and sets TT = 2. The WDR traps to the address at 

RSTVaddr + 0x404,6. WDR sets the processor in a state where it is prepared for diagnosis of 
failures. 


WDR affects only one processor, rather than the entire system. CWP updates due to window 
traps that cause watchdog traps are the same as the no watchdog trap case. 


Software-Initiated Reset (SIR) 


A Software-Initiated Reset (SIR) is initiated by an SIR instruction within any processor. This 
per-processor reset has a trap type 4 at physical address offset 0x80. SIR affects only one 
processor, rather than the entire system. 
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12.3 


RED_state Trap Vector 


When a SPARC-V9 processor processes a reset or trap that enters RED_state, it takes a trap 
at an offset relative to the RED_state_trap_vector base address (RSTVaddr). The trap offset 
depends on the type of RED mode trap and takes the values: 


POR 0x20 
WDR 0x40 
XIR 0x60 
SIR 0x80 
Other 0xA0 


In the UltraSPARC IIli processor, the following is the RSTV base address: 
Virtual Address: OXFFFF FFFF F000 0000 
Physical Address, PA[42:0]: 0x7FF F000 0000 
The UltraSPARC Ili processor has a RMTV pin to select a second RSTV to allow use of PC 
compatible SuperIO chips on a PCI bus. The following is the second RSTV base address: 
Virtual Address: OXFFFF FFFF FFFF 0000 
Physical Address, PA[42:0]: 0x7FF FFFF 0000 





12.4 
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Initialization and Use of the Return Address 
Stack 


The need to initialize the various L1-cache and L2-cache states, and MMU states, is well 
understood, but in the past the need to initialize other caching devices has been overlooked. 
The Return Address Stack (RAS) is one such device. While it is initialized to zero when 
RED mode is entered, zeroes may not be an appropriate PA or VA. 


Failure Scenario 


With the I-MMU off, the RAS can be used to generate a predicated physical address for 
prefetch. However, the RAS may have a virtual address in it, from execution while the 
I-MMU was enabled. This virtual address is used as is for instruction prefetch and may cause 
side-effects at whatever destination it indicates, or other errors. 
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The UltraSPARC IIIi processor uses the RAS for prediction for CALL, RETURN, DONE, and 
RETRY. The UltraSPARC IIli processor considers RETURN to be a JMPL with an %rs1 
equal to %07 (normal subroutine) or %i7 (leaf subroutine). 





There are possibly other cases that use RAS for prefetch. For instance, immediately after 
writing to the LSU control register to enable the I-MMU. 


The issue also exists whenever software turns off the I-MMU after executing for a while with 
the I-MMU enabled. This should only happen due to traps to RED mode, for normal 
software. There is no problem for the transition of I-MMU off to on, because I-MMU will 
block the prefetch address if it is an I-MMU miss, and it will get flushed away when the 
prediction is determined to be wrong. 


Software Rules 


After any reset, trap to RED mode, or transition of the I-MMU from on to off, the 8-level 
RAS should be initialized with eight CALL instructions to a valid non-cacheable address 
before PSTATE.RED turns off. If the I-.MMU is enabled before PSTATE. RED turns off, 
there may be no issue to worry about, if VA == 0x0 is unmapped, the prefetch will be 
disabled. 


























The output of the RAS is forced to the Red Mode Trap Vector (RMTV) while 
PSTATE.RED == 1. However, the RAS is initialized to zeroes, so when PSTATE. RED turns 
off, the zeroes are used for prediction, and may not be valid addresses (cacheable or 
non-cacheable). 





























12.5 Machine States 


TABLE 12-1 shows the machine state created as a result of any reset, or after entering 
RED state. 
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TABLE 12-1 Machine State After Reset and in RED. state (1 of 5) 
Power-On 
Name Fields Reset System Reset | WDR XIR SIR RED statet 
Integer Registers Unknown Unchanged Unchanged 
Floating-Point Registers Unknown Unchanged Unchanged 
L2-Cache EC_MOSI 1 1 Unchanged 
Control Register EC_Pwr_Up 0 0 
EC_Act_Way++ | 0 0 
EC_Block 0 0 
EC_size++ 0 0 
EC_par_En 0 0 
EC_ECC_en 0 0 
EC ECC force | 0 0 
EC. check 0 0 
RSTV Value If processor pin rmtv = 0 VA=Oxffff ffff £000 0000, PA=0x7ff f000 0000 else 
VA-0xffff ffff ffff 0000, PA = Ox7ff ffff 0000. 
PC RSTV | 0x20} RSTVIOx20 | RSTV I RSTV | RSTV | RSTV | 0xa0 
nPC 0x40 0x60 0x80 
RSTV | 0x24| RSTV | 0x24 | RSTV | RSTV | RSTV | RSTV | 0xa4 
0x44 0x64 0x84 
PSTATE MM 0 (TSO) 0 (TSO) 0 (TSO) 
RED 1(RED_state)| 1(RED_state) | I(RED. state) 
PEF 1 (FPU on) |1 (FPU on) 1 (FPU on) 
AM O (Full 64-bit | O (Full 64-bit | O (Full 64-bit address) 
address address 
PRIV 1 (Privileged | 1 (Privileged 1 (Privileged mode) 
mode) mode) 
IE 0 (Disable 0 (Disable 0 (Disable interrupts) 
interrupts) interrupts) 
AG 1 (Alternate | 1 (Alternate 1 (Alternate globals 
globals globals selected) 
selected) selected) 
CLE O (Current O (current little PSTATE.TLE 
little-endian) | endian) 
TLE O (Trap little-| O (trap little- Unchanged 
endian) endian) 
IG O (Interrupt | O (Interrupt O (Interrupt globals 
globals not | globals not not selected) 
selected) selected) 
MG 0 (MMU 0 (MMU 0 (MMU globals not 
globals not | globals not selected) 
selected) selected) 
TBA<63:15> Unknown Unchanged Unchanged 
Y Unknown Unchanged Unchanged 
PIL Unknown Unchanged Unchanged 
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TABLE 12-1 Machine State After Reset and in RED. state (2 of 5) 
Power-On 

Name Fields Reset System Reset | WDR XIR SIR RED statef 
CWP Unknown Unchanged Unchanged except for register window traps 
TT[TL] Unchanged |3 4 Trap type 
CCR Unknown Unchanged Unchanged 
ASI Unknown Unchanged Unchanged 
TL MAXTL MAXTL min(TL+1, MAXTL) 
TPC[TL] Unknown Unchanged PC PC & -0x1f| PC 
TNPC[TL] Unknown Unchanged nPC nPC=PC+4 | nPC 
TSTATE CCR Unknown Unchanged CCR 

ASI Unknown Unchanged ASI 

PSTATE Unknown Unchanged PSTATE 

CWP Unknown Unchanged CWP 

PC Unknown Unchanged PC 

nPC Unknown Unchanged nPC 
TICK NPT 1 1 Unchanged | Unchanged | Unchanged 

counter Restart at O | Restart at 0 Count Restart at 0 | Count 
CANSAVE Unknown Unchanged Unchanged 
CANRESTORE Unknown Unchanged Unchanged 
OTHERWIN Unknown Unchanged Unchanged 
CLEANWIN Unknown Unchanged Unchanged 
WSTATE OTHER Unknown Unchanged Unchanged 

NORMAL Unknown Unchanged Unchanged 
VER MANUF 0x003E 

IMPL 0x0016 

MASK mask dependent 

MAXTL 5 

MAXWIN 7 
FSR All 0 0 Unchanged 
FPRS All Unknown Unchanged Unchanged 
Non-SPARC-V9 ASRs 
SOFTINT Unknown Unchanged Unchanged 
TICK_COMPARE INT_DIS 1 (off) 1 (off) Unchanged 

TICK_CMPR 0 0 Unchanged 
STICK NPT 1 1 Unchanged 

counter 0 0 Count 
STICK_COMPARE INT_DIS 1 (off) 1 (off) Unchanged 

TICK_CMPR 0 0 Unchanged 
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TABLE 12-1 Machine State After Reset and in RED. state (3 of 5) 
Power-On 
Name Fields Reset System Reset WDR RED state* 
PERF CONTROL S1 Unknown Unchanged Unchanged 
SO Unknown Unchanged Unchanged 
UT (trace user) | Unknown Unchanged Unchanged 
ST 
(trace system) Unknown Unchanged Unchanged 
PRIV 
(priv access) Unknown Unchanged Unchanged 
PERF COUNTER All Unknown Unknown Unknown 
GSR IM 0 0 Unchanged 
Others Unknown Unchanged Unchanged 
DISPATCH, CONTROL, MS 0 0 Unchanged 
SI 0 0 Unchanged 
RPE 0 0 Unchanged 
BPE 0 0 Unchanged 
OBS 0 0 Unchanged 
IFPOE 0 0 Unchanged 
Non-SPARC-V9 ASIs 
DCU. CONTROL WE O(off) O(off) Unchanged 
All others 0 (off) 0 (off) 0 (off) 
INST BREAKPOINT | All 0 (off) 0 (off) Unchanged 
VA. WATCHPOINT Unknown Unchanged Unchanged 
PA WATCHPOINT Unknown Unchanged Unchanged 
I-& DMMU_SFSR, ASI Unknown Unchanged Unchanged 
FT Unknown Unchanged Unchanged 
E Unknown Unchanged Unchanged 
CTXT Unknown Unchanged Unchanged 
PRIV Unknown Unchanged Unchanged 
W Unknown Unchanged Unchanged 
OW (overwrite) | Unknown Unchanged Unchanged 
FV (SFSR valid) | 0 0 Unchanged 
NF Unknown Unchanged Unchanged 
TM Unknown Unchanged Unchanged 
DMMU. SFAR Unknown Unchanged Unchanged 
INTR. DISPATCH All 0 0 Unchanged 
INTR_RECEIVE BUSY 0 Unchanged 
SOURCE Unknown Unchanged Unchanged 
ESTATE_ERR_EN All 0 (All off) 0 (All off) Unchanged 
AFAR PA Unknown Unchanged Unchanged 
AFSR All 0 Unchanged Unchanged 
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TABLE 12-1 Machine State After Reset and in RED. state (4 of 5) 




















Power-On 
Name Fields Reset System Reset | WDR XIR a 
MCU CTL REGI CIk Update Unknown 0 Unchanged 
CIK Stop Unknown 0 Unchanged 
30 Unknown 0 Unchanged 
Remaining bits | 0 Unchanged Unchanged 
MCU CTL REG2 CLK 2 effect Unchanged 
propagated 
PLL2 MI y effect Unchanged 
propagated 
PLL2 M2 3 effect Unchanged 
propagated 
Remaining bits | 0 Unchanged Unchanged 
MCU_CTL_REG3 All Unknown Unchanged Unchanged 
JBUS_CONFIG PAR_DLY 0 effect Unchanged 
propagated 
PORT_LOCN Ox7f effect Unchanged 
propagated 
PORT_PRES J_PACK6- unchanged Unchanged 
0<2:0> 
DBG2 Oxf effect Unchanged 
propagated 
DTL { DOWN_25, | unchanged Unchanged 
UP. OPEN] 
MID Ox3e unchanged Unchanged 
MR 0 unchanged Unchanged 
MT 0 unchanged Unchanged 
AID {[4:3],[2:0]} | (00,J. ID effect Unchanged 
<2:0>} propagated 
SW_JERR 0 0 Unchanged 
E* CLK 0 unchanged Unchanged 
SRT 0 effect Unchanged 
propagated 
TOF 0 effect Unchanged 
propagated 
TOV 0 effect Unchanged 
propagated 
DBGI 0x7 effect Unchanged 
propagated 
CLK 0 effect Unchanged 
propagated 
ARB_MODE 0 effect 
propagated 
JP_IMP_CTLO All Varies Varies Varies 
JP IMP CTLI All 0 Unchanged Unchanged 
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TABLE 12-1 Machine State After Reset and in RED. state (5 of 5) 


Name 


JP_IMP_CTL2 


Data (Includes Data, Ins 
and Write Caches) 


Other Processor-Specific States 


Power-On 
Fields Reset System Reset | WDR RED state* 
[63:8] 0 0 Unchanged 
[7:0] 0 Unchanged 


Processor L2-Cache Tags, Micro-tags and | Unknown Unknown Unchanged 


truction, Prefetch, 











Cache Snooping Enabled 

Instruction Queue Empty 

Store Queue Empty Empty Unchanged 
I-TLB, D-TLB Mappings, Valid, | Unknown Unknown Unchanged 


Lock, E-bit, NC- 
bit, Global bit, 
etc. 











"This register is read-only from the system. 


t Processor states are only up 
because the PSTATE.RED 


dated according to the following table if RED state is entered due to a reset or a trap. If RED state is entered 
bit was explicitly set to 1, then software must create the appropriate states itself. 





** These bits will read as 0 after POR or System Reset, but subsequent to the first write to this register, will read as 1. 


Effect propagated: Some CSRs have delayed effects after writes by software. The readable CSR is updated by the software write, and on 
the next reset, the contents of a shadow register is updated from the CSR, which affects chip behavior from then on. Until the update 
happens, the shadow register has the old state. If the reset event never happens, it will never have an effect. A Hard POR initializes the 
shadow register to the same state as the readable CSR. 
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CHAPTER 


A 





Instruction Definitions 





Related instructions are grouped into subsections. Each subsection consists of the following 
parts: 


1. 


A table of the opcodes defined in the subsection with the values of the field(s) that 
uniquely identify the instruction(s). 


An illustration of the applicable instruction format(s). In these illustrations, a dash (—) 
indicates that the field is reserved for future versions of the architecture and shall be zero 
in any instance of the instruction. If the processor encounters nonzero values in these 
fields, its behavior is undefined. 


A description of the features, restrictions, and exception-causing conditions. 


A list of exceptions that can occur as a consequence of attempting to execute the 
instruction(s). Exceptions due to an instruction_access_error, 
instruction access exception, fast instruction access MMU miss, fast ECC error, 
ECC. error (corrected ECC error), WDR, and interrupts are not listed because they can 
occur on any instruction. Instructions not implemented in hardware shall generate an 
illegal instruction exception and therefore will not generate any of the other exceptions 
listed. The illegal instruction exception is not listed because it can occur on any 
instruction that triggers an instruction breakpoint or contains an invalid field. 


Instruction latencies and execution rates are provided in Chapter 4 “Instruction Execution.” 
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Operation 


TABLE A-2 summarizes the instruction set; the instruction definitions follow the table. Within 
TABLE A-2 and throughout this chapter, certain opcodes are marked with mnemonic 
superscripts. The superscripts and their meanings are defined in TABLE A-1. 


TABLE A-4 Opcode Superscripts 





Superscript Meaning 





D Deprecated instruction 
P Privileged opcode 


Pası Privileged action if bit 7 of the referenced ASI is zero 





PAsn Privileged opcode if the referenced ASR register is privileged 





Pupr Privileged action if PSTATE.PRIV = 0 and (S)TICK.NPT = 1 








Pprc Privileged action if PCR. PRIV — 1 


TABLE A-2 Instruction Set (7 of 6) 


V9 extension 
Name formats 





ADD, ADDcc 


Add (and modify condition codes) 


LIGNADDRESS{_LITTLE} Calculate address for misaligned data 3 





D, ANDcc 


And (and modify condition codes) 





DN, ANDNCC 


And not (and modify condition codes) 


Three-Dimensional array addressing instructions 3 





Branch on integer condition codes with prediction 28 
Branch on integer condition codes 42 


5 
Set the GSR.MASK field 2 3 








Branch on contents of integer register with prediction (also known 
as BRr) 





BSHUFFLE 


Permute bytes as specified by GSR. MASK 3 





CALL 


CasaPast 


Call and link 


Compare and swap word in alternate space 





CASxAP^s 


Compare and swap doubleword in alternate space 








FANDNOT(1,2)(S) 


3 
Floating-point absolute value 
Floating-point add 
Perform data alignment for misaligned data 3 


Logical AND operation 





FBfccD 
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33 
Logical AND operation with one inverted source 33 3 
Branch on floating-point condition codes 42 
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TABLE A-2 Instruction Set (2 of 6) 


Name 


V9 extension 


formats 





Branch on floating-point condition codes with prediction 




































































FCMPE(s,d,q) Floating-point compare (exception if unordered) 
FCMP(GT,LE,NE,EQ)(16,32) Pixel compare operations 3 
FDIV(s,d,q) Floating-point divide 
EXPAND Pixel expansion 3 
iTO(s,d,q) Convert integer to floating-point 
LUSH Flush instruction memory 
OV(s,d,q) Floating-point move 
OV(s,d,q)cc Move floating-point register if condition is satisfied 
OV(s,d,q)r Move floating-point register if integer register contents satisfy 
condition 
FMUL(s,d,q) Floating-point multiply 
FMUL8x16(AU,AL) 8x16 upper/lower a partitioned product 3 
FMUL8(SU,UL)x16 8x16 upper/lower partitioned product 3 
FMULD8(SU,UL)x16 8x16 upper/lower partitioned product 3 
EG(s,d,q) Floating-point negate 
OR(S) Logical NOR operation 3 
OT(1,2){S} Copy negated source 3 
3 
FOR{S} Logical OR operation 3 
FORNOT(1,2){S} Logical OR operation with one inverted source 3 
FPACK(16,32, FIX) Pixel packing 3 
FPADD(16,32){S} Pixel add (single) 16- or 32-bit 3 
Pixel merge 3 
FPSUB(16,32){S} Pixel subtract (single) 16- or 32-bit 3 
FsMULd Floating-point multiply single to double 
FSQRT(s,d,q) Floating-point square root 
3 
F(s,d,q)TOi Convert floating-point to integer 
F(s,d,q)TO(s,d,q) Convert between floating-point formats 
F(s,d,q)TOx Convert floating-point to 64-bit integer 
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TABLE A-2 Instruction Set (3 of 6) 


V9 extension 
formats 


U 

t 
Q 

o 


Operation Name 





N 
o 
oo 


FSUB(s,d,q) Floating-point subtract 





























Logical XOR operation 332 3 
Convert 64-bit integer to floating-point 306 
Zero fill 332 3 
ILLTRAP [legal instruction 
JMPL Jump and link 317 
LDDD Load integer doubleword 
LDDF Load double floating-point 318 
LDDFA ASI_FL* Short floating-point loads (VIS I) 400 3 
LDF Load floating-point 


Load floating-point from alternate space 





Load floating-point state register lower 











LDSB Load signed byte 
LDSBAPASI Load signed byte from alternate space 





LDSH Load signed halfword 


LDOF Load guad floating-point 
LDOFAPASI Load guad floating-point from alternate space 


Load signed word 
DSWAPASI! Load signed word from alternate space 


E E 
tJ 
n 
= 


LDUB Load unsigned byte 
LDUBAPASI Load unsigned byte from alternate space 


E 
Iw) 


UH Load unsigned halfword 


LDUHAPAs! Load unsigned halfword from alternate space 
LDUW Load unsigned word 


LDUWAPASI Load unsigned word from alternate space 


ol Ol BT o|o 
=| =| ùj =j = 
oo| col =| co] oo 














322 
324 
322 
324 
329 
330 
322 
324 
322 
324 
322 
324 
322 
324 

22 


LDX Load extended 3 
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TABLE A-2 Instruction Set (4 of 6) 


V9 extension 
Operation Name formats 





LDXAPASI Load extended from alternate space 





Load floating-point state register 





Memory barrier 


Move integer register if condition is satisfied 343 


Move integer register on contents of integer register 





Multiply step (and modify condition codes) 





Multiply 64-bit integers 








No operation 





Inclusive OR (and modify condition codes) 


Pixel component distance 3 





Population Count 





Prefetch data 





Read ancillary state register 





Read condition codes register 





Read dispatch control register 





Read floating-point registers state register 





Read graphic status register 


Read program counter 
Read performance control register 


Read performance instrumentation counters 








Read privileged register 





Read per-processor soft interrupt register 


Read TICK compare register 





Read Y register 





ESTORE Restore caller’s window 





ESTOREDP Window has been restored 


Return from trap and retry 
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TABLE A-2 Instruction Set (5 of 6) 


V9 extension 























Operation Name Page formats 
SAVE Save caller’s window 392 
SAVED? Window has been saved 394 
sDIVP, spivecP 32-bit signed integer divide (and modify condition codes) 428 
SDIVX 64-bit signed integer divide 357 
Set high 22 bits of low word of integer register 397 
Set Interval Arithmetic Mode (VIS II) 395 
Software-initiated reset 403 
Shift left logical (IU) 398 





|SLLX Shift left logical, extended (IU) 














B98 | 
SRL Shift right logical (IU) 398 
SRLX Shift right logical, extended (IU) 
08 


Em i IU 


Store byte into alternate space (IU) 


FN 
e 
© 





T 
EN 
hn 


Store barrier 





X 


Store doubleword 43 


Bra. — 0 01 - Bum doubleword into alternate space 
STDF Store double floating-point (FP) 








44 
0 
0 

2 
0 


STDFAPAS Store double floating-point into alternate space (FP) 
STDFA ASI_BLK* Block stores 
* 


A 

A 

|STDFA ASI FL* = . Short floating-point stores (VIS I) 4 
T 


STDFA ASI_PST* Partial Store instructions 








5 
4 
6 

74 
0 

59 








STF Store floating-point (FP) 04 





06 


Store floating-point into alternate space (FP) 


STFAPASI 





Store floating-point state register (FP) 4 


STF SRP 





A 








STH Store halfword (IU) 
Sg THAPASI Store halfword into alternate space (IU) 





STOF Store guad floating-point (FP) 
STOFAPASI Store quad floating-point into alternate space (FP) 


STW Store word (IU) 





STWAPASI Store word into alternate space (IU) 


2 
408 
409 
404 
406 
408 
409 

08 


STX Store extended (IU) 4 
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TABLE A-2 Instruction Set (6 of 6) 


V9 extension 
Operation Name formats 





STxaPasi Store extended into alternate space (IU) 





STXFSR Store extended floating-point state register (MS) 





SUB, SUBcc Subtract (and modify condition codes) 


SUBC, SUBCcc Subtract with carry (and modify condition codes) 
Swap integer register with memory 446 


swapaD: Pasi Swap integer register with memory in alternate space 





TADDcc, TADDccTVD Tagged add and modify condition codes (trap on overflow) 


Tcc Trap on integer condition codes 





TSUBcc, TSUBccTVD Tagged subtract and modify condition codes (trap on overflow) 


Write ancillary state register 





Write condition codes register 


Write dispatch control register 





Write floating-point registers state register 


WRGSR Write graphic status register 
WRPCRP Write performance control register 


Rp IcPric Write performance instrumentation counters register 








RPRP Write privileged register 





RSOFTINTP Write per-processor soft interrupt register 


RSOFTINT SETP Set bits of per-processor soft interrupt register 





RTICK CMPRP Write TICK compare register 





RSTICK? Write System TICK register 


RYD Write Y register 








OR, XNORcc Exclusive NOR (and modify condition codes) 
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A.l 


Add 























Opcode Op3 Operation 

ADD 00 0000 Add 

ADDcc 01 0000 Add and modify condition codes 

ADDC 00 1000 Add with Carry 

ADDCcc 01 1000 Add with Carry and modify condition codes 
Format (3) 
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25 24 19 18 14 13 12 5 4 0 

















Assembly Language Syntax 

add reg,. y, l'eg. or imm, reg,g 
addcc reg,. y, l'eg. or imm, reg,g 
addc reg,.j, reg or imm, reg,g 
addccc reg,.j, l'eg. or imm, reg, 
Description 


ADD and ADDcc compute *r[rs1] *r[rs2]"ifi-0,or 
"r[rs1]-* sign ext (simm13)” if i = 1, and write the sum into r [rd]. 


ADDC and ADDCcc (“ADD with carry”) also add the CCR register’s 32-bit carry (icc.c) bit; 
that is, they compute “r[rsl] Fr[rs2] Ficc.c” or 
“r[rsl] +sign_ext (simm13) + icc.c” and write the sum into r [rd]. 


ADDcc and ADDCcc modify the integer condition codes (CCR. icc and CCR. xcc). 
Overflow occurs on addition if both operands have the same sign and the sign of the sum is 
different. 
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Programming Note — ADDC and ADDCcc read the 32-bit condition codes carry bit 
(CCR.icc.c), not the 64-bit condition codes carry bit (CCR. xcc.c). 








Compatibility Note — ADDc and ADDCcc were named ADDX and ADDXcc, respectively, 
in SPARC-V8. 





Exceptions 


None 





A.2 Alignment Instructions (VIS I) 


























Opcode opf Operation 
ALIGNADDRESS 0 0001 1000 Calculate address for misaligned data access 
ALIGNADDRESS_LITTLE 0 0001 1010 Calculate address for misaligned data access little- 
endian 
FALIGNDATA 0 0100 1000 Perform data alignment for misaligned data 
Format (3) 
31 30 29 25 24 19 18 14 13 5 4 0 


Assembly Language Syntax 





alignaddr Tegys], l'égys2, VCS rq 


alignaddrl Fêg,.], VCSr525 Yegyq 











faligndata Sregrs1, JY€&ys2, Jegra 
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Description 





ALIGNADDRESS adds two integer values, r[rs1] and r[rs2], and stores the result (with 
the least significant three bits forced to zero in the integer register r [rd]. The least 
significant three bits of the result are stored in the GSR. align field. 











ALIGNADDRESS LITTLE is the same as ALIGNADDRESS except that the two's- 
complement of the least significant 3 bits of the result is stored in GSR. align. 











Note — ALIGNADDR LITTLE generates the opposite-endian byte ordering for a subsequent 
FALIGNDATA operation. 








FALIGNDATA concatenates the two 64-bit floating-point registers specified by rs1 and rs2 
to form a 128-bit (16-byte) intermediate value. The contents of the first source operand form 
the more-significant 8 bytes of the intermediate value, and the contents of the second source 
operand form the less-significant 8 bytes of the intermediate value. Bytes in the intermediate 
value are numbered from most significant (byte 0) to least significant (byte 15). Eight bytes 
are extracted from the intermediate value and stored in the 64-bit floating-point destination 
register specified by rd. GSR. align, specifying the number of the most significant byte to 
extract (therefore, the least significant byte extracted from the intermediate value is 
numbered GSR. align + 7). 


A byte-aligned 64-bit load can be performed as shown in CODE EXAMPLE A-1. 
CODE EXAMPLE A-1 Byte-Aligned 64-Bit Load 


alignaddr Address, Offset, Address 
ldd [Address], %£0 
ldd [Address + 8], %£2 


faligndata $f0, $f2, $f4 














Programming Note — For good performance, the result of FALIGNDATA should not be 
used as a source operand for a 32-bit FP or VIS instruction in the next three instruction 
groups. 





Exceptions 


fp. disabled 
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A.3 Three-Dimensional Array Addressing 
Instructions (VIS I) 




















Opcode opf Operation 

ARRAY 8 0 0001 0000 Convert 8-bit 3D address to blocked byte address 

ARRAY16 0 0001 0010 Convert 16-bit 3D address to blocked byte address 

ARRAY 32 0 0001 0100 Convert 32-bit 3D address to blocked byte address 
Format (3) 


31 30 29 25 24 


array8 





19 18 


Assembly Language Syntax 


TêSrs]; VC8rs2 l'e8y4 


arrayl6 TêSrs]; VC8rs2 VêSrd 


14 13 5 4 0 








array32 Teg,s b VC8rs2 l'e8y4 








Description 


These instructions convert three-dimensional (3D) fixed-point addresses contained in 
r[rs1] to a blocked-byte address; they store the result in r [rd]. Fixed-point addresses 
typically are used for address interpolation for planar reformatting operations. Blocking is 
performed at the 64-byte level to maximize L2-cache block reuse, and at the 64 KB level to 
maximize TLB entry reuse, regardless of the orientation of the address interpolation. These 
instructions specify an element size of 8 bits (ARRAY8), 16 bits (ARRAY16), or 32 bits 
(ARRAY32). The second operand, r[rs2], specifies the power-of-2 size of the X and Y 
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dimensions of a 3D image array. The legal values for rs2 and their meanings are shown in 
TABLE A-3. Illegal values produce undefined results in the destination register, r [rd]. 
FIGURE A-1 illustrates a three-dimensional array fixed-point address format. 


TABLE A-3 Three-Dimensional r[rs2] Array X/Y Dimensions 















































r[rs2] value Number of Elements 
0 64 
1 128 
2 256 
3 512 
4 1024 
5 2048 
Z integer Z fraction Y integer Y fraction | X integer X fraction 
63 55 54 44 43 33 32 22 21 11 10 0 


FIGURE A-1 Three-Dimensional Array Fixed-Point Address Format 


The integer parts of X, Y, and Z are converted to the following blocked-address formats 
illustrated in FIGURE A-2, FIGURE A-3, and FIGURE A-4. 


Middle 





-2isrc2 + 2 isrc2 + isrc2 


FIGURE A-2 Three-Dimensional Array Blocked-Address Format (Array8) 





+ 2 isrc2 + 2 isrc2 + isrc2 


FIGURE A-3 Three-Dimensional Array Blocked-Address Format (Array16) 
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Upper Middle Lower 

















22 
+ 2 isrc2 


19 19 19 15 11 7 6 4 
+ 2 isrc2 4 isrc2 


FIGURE A-4 Three-Dimensional Array Blocked-Address Format (Array32) 


The bits above Z upper are set to zero. The number of zeroes in the least significant bits is 
determined by the element size. An element size of 8 bits has no zeroes, an element size of 
16 bits has one zero, and an element size of 32 bits has two zeroes. Bits in X and Y above 


the size specified by r[rs2] are ignored. 


The code fragment in CODE EXAMPLE A-2 shows assembly of components along an 


interpolated line at the rate of one component per clock. 


CODE EXAMPLE A-2 Three-Dimensional Array Addressing Example 





add Addr, DeltaAddr, Addr 
array8 Addr, $g0, bAddr 
ldda [bAddr] ASI FL8, PRIMARY, data 


faligndata data, accum, accum 





Note — To maximize reuse of L2-cache and TLB data, software should block array 
references of a large image to the 64 KB level. This means processing elements within a 


32x64x64 block. 


Exceptions 


None 
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A.4 Block Load and Block Store (VIS I) 









































Opcode imm_asi ASI Value |Operation 

LDDFA ASI BLK AIUP 7016 64-byte block load/store from/to primary address 

STDFA space, privilege mode access only 

LDDFA ASI BLK AIUS 7116 64-byte block load/store from/to secondary 

STDFA address space, privilege mode access only 

LDDFA ASI_BLK_AIUPL 7816 64-byte block load/store from/to primary address 

STDFA space, little-endian, privilege mode access only 

LDDFA ASI BLK AIUSL 7916 64-byte block load/store from/to secondary 

STDFA address space, little-endian, privilege mode access 
only 

LDDFA ASI BLK P F016 64-byte block load/store from/to primary address 

STDFA space 

LDDFA ASI_BLK_S Flig 64-byte block load/store from/to secondary 

STDFA address space 

LDDFA ASI BLK PL F816 64-byte block load/store from/to primary address 

STDFA space, little-endian 

LDDFA ASI_BLK_SL F916 64-byte block load/store from/to secondary 

STDFA address space, little-endian 

STDFA ASI BLK COMMIT P |E0jg 64-byte block commit store to primary address 
space 

STDFA ASI BLK COMMIT S |Eljg 64-byte block commit store to secondary address 
space 














Format (3) LDDFA 


wo Hl — | 


31 30 29 25 24 19 18 14 13 5 4 0 
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Format (3) STDFA 


= [=  —- [| 


31 30 29 

















25 24 19 18 14 13 5 4 0 
Assembly Language Syntax 
ldda [reg addr] imm asi, freg,a 
ldda [reg plus imm] Sasi, freg,g 
stda freg,g, [reg addr) imm asi 
stda frega [reg plus imm] %asi 
Description 


A block load (BLD) or block store (BST) instruction uses an LDDFA or STDFA instruction 
combined with a block transfer ASI. Block transfer ASIs allow BLDs and BSTs to be 
performed accessing the same address space as normal loads and stores. Little-endian ASIs 
(those with an ‘L’ suffix) access data in little-endian format; otherwise, the access is assumed 
to be big-endian. Byte swapping is performed separately for each of the eight double- 
precision registers used by the instruction. Endianness does not matter if these instructions 
are only being used for a block copy operation. 


A BST with commit forces the data to be written to memory and invalidates copies in all 
caches present. As a result, a BST with commit maintains coherency with the I-cache.! It 
does not, however, flush instructions that have already been fetched into the pipeline before 
executing the modified code. If a BST with commit is used to write modified instructions, a 
FLUSH instruction must still be executed to guarantee that the instruction pipeline is flushed. 


LDDFA with a block transfer ASI loads 64 bytes of data from a 64-byte aligned memory area 
into the eight double-precision floating-point registers specified by rd. The lowest-addressed 
eight bytes in memory are loaded into the lowest-numbered double-precision destination 
register. An illegal instruction exception occurs if the floating-point registers are not aligned 
on an eight double-precision register boundary. The least significant six bits of the memory 
address must be zero or a mem address not aligned exception occurs. 


STDFA with a block transfer ASI stores data from the eight double-precision floating-point 
registers specified by rs1 to a 64-byte-aligned memory area. The lowest-addressed eight 
bytes in memory are stored from the lowest-numbered double-precision rd. An 


1. All store instructions in the processor coherently update the instruction cache. In general SPARC-V9 implementations, 
the store instructions (other than BST with Commit) do not maintain data coherency between instruction and data caches. 


Chapter A Instruction Definitions 275 


276 


illegal_instruction exception occurs if the floating-point registers are not aligned on an eight 
register boundary. The least significant 6 bits of the memory address must be zero or a 
mem_address_not_aligned exception occurs. 


ASIs E0, and Elj¢ are only used for BST with commit operations; they are not used for 
BLD operations. 





Programming Note — In the UltraSPARC IIIi processor, BLD does not offer a 
performance advantage over normal loads. For high performance, the use of prefetch 
instructions and 8-byte loads is recommended. BST and BST with commit can offer 
performance advantage and are used in high performance UltraSPARC Ili processor 
libraries. 








Programming Note — BLD does not provide register dependency interlocks, as ordinary 
load instructions do. 


Before BLD data can be referenced, a second BLD (to a different set of registers) or a 
MEMBAR #Sync must be performed. If a second BLD is used to synchronize against 
returning data, the processor will continue execution before all data has been returned. The 
programmer is then responsible for scheduling instructions so registers are only used when 
they become valid. 





To determine when data is valid, the programmer must count instruction groups containing 
floating-point (FP) operate instructions (not FP loads or stores). The lowest-numbered 
destination register of the first BLD may be referenced in the first instruction group 
following the second BLD, using an FP operate instruction only. 


The second-lowest-numbered destination register of the first BLD may be referenced in the 
second instruction group containing an FP operate instruction, and so on. 


If this block-load/block-load synchronization mechanism is used, the initial reference to the 
BLD data must be an FP operate instruction (not an FP store), and only instruction groups 
with FP operate instructions are counted when determining BLD data availability. 


If these rules are violated, data from before or after the BLD may be returned by a reference 
to any of the BLD’s destination registers. 


If a MEMBAR #Sync is used to synchronize on BLD data, there are no restrictions on data 
usage, although performance will be lower than if block-load/block-load synchronization is 
used. No other MEMBARs can be used to provide data synchronization for BLD. 








FP operate instructions can be issued in a single instruction group with FP stores. If block- 
load/block-load synchronization is used, FP operates and FP stores can be interlaced. This 
allows an FP operate instruction, such as FMOVD or FALIGNDATA, to reference the returning 
data before using the data in any FP store (normal store or BST). 


UltraSPARC Illi Processor User's Manual * June 2003 


The processor also continues execution, without register interlocks, before all the store data 
for BSTs are transferred from the register file. 





If store source registers are overwritten before the next BST or MEMBAR #Sync instruction, 
then the following rule must be observed: The first register can be overwritten in the same 
instruction group as the BST, the second register can be overwritten in the instruction group 
following the BST, and so on. If this rule is violated, the BST may use the old or the new 
(overwritten) data. 


When determining correctness for a code sample, note that the processor may interlock more 
than what is required above. For example, there may be partial register interlocks, such as on 
the lowest-number register. 


Code that does not meet the above constraints may appear to work on a particular processor. 
However, to be portable across all processors similar to the UltraSPARC IlIi processor, all of 
the above rules should be followed. 


Rules 


Note — These instructions are used for transferring large blocks of data (more than 

256 bytes), for example, in C library routines bcopy () and bfil1(). They do not allocate 
in the data cache or L2-cache on a miss. They update the L2-cache on a hit. One BLD and, 
in the most extreme cases, up to fifteen (maximum) BSTs can be outstanding on the 
interconnect at one time. 


To simplify the implementation, BLD destination registers may or may not interlock like 
ordinary load instructions. Before the BLD data is referenced, a second BLD (to a different 
set of registers) or a MEMBAR f Sync must be performed. If a second BLD is used to 
synchronize with returning data, then it continues execution before all data have been 
returned. The lowest-number register being loaded can be referenced in the first instruction 
group following the second BLD, the second lowest number register can be referenced in the 
second group, and so on. If this rule is violated, data from before or after the load may be 
returned. 


Similarly, BST source data registers are not interlocked against completion of previous load 
instructions (even if a second BLD has been performed). The previous load data must be 
referenced by some other intervening instruction, or an intervening MEMBAR #Sync must be 
performed. If the programmer violates these rules, data from before or after the load may be 
used. The load continues execution before all of the store data have been transferred. If store 
data registers are overwritten before the next BST or MEMBAR #Sync instruction, then the 
following rule must be observed: The first register can be overwritten in the same instruction 
group as the BST, the second register can be overwritten in the instruction group following 
the BST, and so on. If this rule is violated, the store may store correct data or the overwritten 
data. 
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There must be a MEMBAR #Sync or a trap following a BST before a DONE, RETRY, or WRPR 
to PSTATE instruction is executed. If this is rule is violated, instructions after the DONE, 
RETRY, or WRPR to PSTATE may not see the effects of the updated PSTATE register. 























BLD does not follow memory model ordering with respect to stores. In particular, read-after- 
write and write-after-read hazards to overlapping addresses are not detected. The side-effects 
bit (TTE . E) associated with the access is ignored. Some ordering considerations are as 
follows: 














- If ordering with respect to earlier stores is important (for example, a BLD that overlaps 
previous stores), then there must be an intervening MEMBAR #StoreLoad or stronger 
MEMBAR. 








- Ifordering with respect to later stores is important (for example, a BLD that overlaps a 
subsequent store), then there must be an intervening MEMBAR #LoadStore ora 
reference to the BLD data. This restriction does not apply when a trap is taken; therefore, 
the trap handler does not have to worry about pending BLDs. 








- Ifthe BLD overlaps a previous or later store and there is no intervening MEMBAR, then the 
trap or data referencing the BLD may return data from before or after the store. 


BST does not follow memory model ordering with respect to loads, stores, or flushes. In 
particular, read-after-write, write-after-write, flush-after-write, and write-after-read hazards to 
overlapping addresses are not detected. The side-effects bit associated with the access is 
ignored. Some ordering considerations are as follows: 


- If ordering with respect to earlier or later loads or stores is important, then there must be 
an intervening reference to the load data (for earlier loads) or an appropriate MEMBAR 
instruction. This restriction does not apply when a trap is taken; therefore, the trap handler 
does not have to worry about pending BSTs. 





- Ifthe BST overlaps a previous load and there is no intervening load data reference or 
EMBAR fStoreLoad instruction, then the load may return data from before or after the 
store and the contents of the block are undefined. 





- Ifthe BST overlaps a later load and there is no intervening trap or 
EMBAR #LoadStore instruction, then the contents of the block are undefined. 


- Ifthe BST overlaps a later store or flush and there is no intervening trap or 
EMBAR #Sync instruction, then the contents of the block are undefined. 


- If the ordering of two successive BST instructions (overlapping or not) is required, then a 
EMBAR #Sync must occur between the BST instructions. 























Block operations do not obey the ordering restrictions of the currently selected processor 
memory model (TSO, PSO, RMO). Block operations always execute under an RMO memory 
ordering model. Explicit MEMBAR instructions are required to order block operations among 
themselves or with respect to normal memory operations. In addition, block operations do 
not conform to dependence order on the issuing processor; that is, no read-after-write, write- 
after-read, or write-after-write checking occurs between block operations. Explicit 

MEMBAR 4 Sync instructions are required to enforce dependence ordering between block 
operations that reference the same address. 
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Typically, BLD and BST will be used in loops where software can ensure that the data being 
loaded and the data being stored do not overlap. The loop will be preceded and followed by 


the appropriate MI 


CODE EXAMPLE A-3 Byte-Aligned Block Copy Inner Loop with Block Load/Block Store 





EMBARS to ensure that there are no hazards with loads and stores outside the 
loops. CODE EXAMPLE A-3 demonstrates the loop. 


Note that the loop must be unrolled two times to achieve maximum performance. All FP registers 


are double-precision. Eight versions of this loop are needed to handle all the cases of doubleword 


misalignment between the source and destination. 


loop: 


d| Lee 


Chapter A 


faligndata 
faligndata 
faligndata 
faligndata 
faligndata 
faligndata 
faligndata 
addcc 

bg,pt 


fmovd 


ldda 

stda 
faligndata 
faligndata 
faligndata 
faligndata 
faligndata 
faligndata 
faligndata 
faligndata 
addcc 
be,pnt 
fmovd 

ldda 

stda 


Sf0, 
$f2, 
$f4, 
Sf6, 
Sf8, 
$f10, 
$f12, 
$10, 
ER 
$f14, 


[regaddr] ASI BLK P, 
[regaddr] ASI BLK P 


$f32, 
$f48, 
$f16, 
$f18, 
$f20, 
$f22, 
S£24, 
$£26, 
$f28, 
$10, 

done 

$£30, 


tregaddr) ASI BLK P, 
tregaddr) ASI BLK P 


$f32, 


$f2, $f34 
$f4, %£36 
$f6, $f38 
S$f8, $f40 
%f10, $f42 


$f12, $f44 
$f14, $f46 


-1, $10 


Sf48 


$f16, $f32 
$f18, $f34 
$f20, $f36 
$f22, $f38 
$f24, $f40 
$f26, $f42 
$f28, $f44 
$f30, $f46 


-1, $10 


S£48 
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CODE EXAMPLE A-3 Byte-Aligned Block Copy Inner Loop with Block Load/Block Store 


ba loop 

faligndata $f48, $f0, $f32 
done: ! (end of loop processing) 
Bcopy Code 


To achieve the highest Bcopy bandwidths, use prefetch instructions and floating-point loads 
instead of BLD instructions. Using prefetch instructions to bring memory data into the 
prefetch cache hides all of the latency to memory. This allows a Bcopy loop to run at 
maximum bandwidth. CODE EXAMPLE A-4 shows how to modify the standard UltraSPARC I 
processor bcopy () loop to use PREFETCH and floating-point load instructions instead of 
BLDs. 


CODE EXAMPLE A-4 High-Performance bcopy () Preamble Code 


























preamble: 
prefetch srcaddr] , 1 
prefetch srcaddr-0x40],1 
prefetch srcaddr-0x80],1 
prefetch srcaddr-0xc0],1 
lddf srcaddr] , £0 
prefetch srcaddr-0x100],1 
lddf srcaddr- 0x8] , $£2 
lddf srcaddr+0x10],%£4 
faligndata %f0, %f2,%f32 
lddf srcaddr+0x18],%£6 
faligndata Sf£2,%f4, 534 
lddf srcaddr+0x20],%£8 
faligndata S£4,S5£6, £36 
lddf srcaddr+0x28],%£10 
faligndata $f6,9f8,2$f38 
lddf srcaddr+0x30],%£12 
faligndata S£8,%£10,%f£40 
lddf srcaddr+0x38],%£14 
faligndata $£10,2f12,$f42 
lddf srcaddr- 0x40] ,$£16 
subcc count, 0x40, count 
bpe <exit> 
add srcaddr, 0x40 , srcaddr 
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CODE EXAMPLE A-4 


loop: 


fmovd 

lddf 
faligndata 
lddf 
faligndata 
stda 

lddf 
faligndata 
lddf 
faligndata 
lddf 
faligndata 
lddf 
faligndata 
lddf 
faligndata 
lddf 
prefetch 
faligndata 


subcc 


High-Performance bcopy () Preamble Code (Continued) 


$f16,$f0 


[srcaddr4 


0x8], %f2 


%f12,%f14,%f44 





(srcaddr4 


£f14,$f0,$f46 
%£32, [dstaddr 


srcaddra4 


srcaddra4 


$f2,$f4 


srcaddr+ 


srcaddra4 


srcaddra4 


srcaddra4 


$f0,$f2,$f32 


S$f4,$f6,$f306 
$f6,$9$f8,$f38 


, 
$f8,$f10,$f40 


F0x10],$f4 


oe 


0x18], %f6 


0x20], %f8 


£34 


0x28], %f1 


0x30], %f1 


0x38], %f1 








0x40], %f1 








srcaddra4 





F0x100],1 


$f£10,$f12,$f42 


count,O0x40,count 


ASI BLK P 


add 


bpg 
add 


F2 «0 “9 iO “9 o o o - HN 3 o O O o FF O Oo o Mo” NY Lr 


Exceptions 


fp. disabled 


PA watchpoint (recognized on only the first 8 bytes of a transfer) 
VA watchpoint (recognized on only the first 8 bytes of a transfer) 


dstaddr, 0x40, dstaddr 


loop 


srcaddr , 0x40 , srcaddr 


illegal instruction (misaligned rd) 
mem, address not aligned 
data access exception 


data, access error 





fast data access MMU miss 
fast data access protection 
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A.5 


Byte Mask and Shuffle Instructions (VIS II) 
Gsm — per — [ewm — — — — — — —] 





BMASK 0 0001 1001 Set the GSR . MASK field in preparation for a 
following BSHUFFLE instruction 


BSHUFFLE 0 0100 1100 Permute bytes as specified by GSR. MASK 





Format (3) 


31 30 29 
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25 24 19 18 14 13 5 4 0 





Assembly Language Syntax 





bmask l'egys], 'egys2, VCSrd 


bshuffle F Brst fre, s5, frega 











Description 


BMASK adds two integer registers, r[rs1] and r [rs2], and stores the result in the integer 
register r [rd]. The least significant 32 bits of the result are stored in the GSR. mask field. 





BSHUFFLE concatenates the two 64-bit floating-point registers specified by rs1 (more- 
significant half) and rs2 (less-significant half) to form a 16-byte value. Bytes in the 
concatenated value are numbered from most significant to least significant, with the most 
significant byte being byte 0. BSHUFF LE extracts 8 of the 16 bytes and stores the result in 
the 64-bit floating-point register specified by rd. Bytes in the rd register are also numbered 
from most to least significant, with the most significant being byte 0. The following table 
indicates which source byte is extracted from the concatenated value for each byte in rd. 








Destination Byte (in r [rd])|Source Byte 








0 (Most significant) (r[rs1]|| r[cs2])LGSR.mask<31:28>] 
) k<27:24>] 


k<23:20>] 























k<19:16>] 











k<15:12>] 
k<11:8>] 
GSR.mask<7:4>] 

















(r[rs1]|| r[rs2 



































STL DA] Nn] AJ WwW] Nd 


(Least significant) (r[rs1]|| r[cs2])LGSR.mask<3:0>] 
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Note — The BMASK instruction uses the MS pipeline; therefore, it cannot be grouped with a 
store, non-prefetchable load, or a special instruction. The integer rd register result is 
available after a two-cycle latency. A younger BMASK can be grouped with an older 
BSHUFFLE (BMASK is “break-after”). 





Results have a four-cycle latency to other dependent instructions executed in FGA and FGM 
pipelines. The FGA pipeline is used to execute BSHUFFLE. The GSR mask must be set at or 
before the instruction group previous to the BSHUFFLE (GSR.mask dependency). 
BSHUFFLE is fully pipelined (one per cycle). 




















Exceptions 


fp. disabled 





A.6 Branch on Integer Register with Prediction 
(BPr) 


Opcode rcond Operation Register Contents Test 








— 000 Reserved 

BRZ 001 Branch on Register Zero 

BRLEZ 010 Branch on Register Less Than or Equal to Zero 
— 100 Reserved 








BRNZ 101 Branch on Register Not Zero 
BRGZ 110 Branch on Register Greater Than Zero 


BRGEZ 111 Branch on Register Greater Than or Equal to Zero 20 





























Format (2) 
ee |, NE y 
3130 29 28 27 25 24 22 21 20 19 18 14 13 0 
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Assembly Language Syntax 

















brz(,a)(,ptl,pn) reg,,;, label 
brlez(,a)(,ptl,pn) reg, label 
brlz(,a)(,ptl,pn) reg, label 
brnz(,a)(,ptl,pn) reg, 1, label 
brgz(,a)(,ptl,pn) reg, label 
brgez(,a)(,ptl,pn) reg, label 





Programming Note — To set the annul bit for BPr instructions, append “, a” to the 
opcode mnemonic. For example, use “brz,a $13, label.” In the preceding table, braces 
signify that the “, a” is optional. To set the branch prediction bit p, append either “, pt" for 
predict taken or “, pn" for predict not taken to the opcode mnemonic. If neither “, pt” nor 
* , pn" is specified, the assembler shall default to *, pt.” 





Programming Note — Both BP and BR represent branch on integer register with 
prediction. They are, in fact, the same instruction. 





Description 


These instructions branch based on the contents of r[rs1]. They treat the register contents 
as a signed integer value. 


A BPr instruction examines all 64 bits of r[rs1] according to the rcond field of the 
instruction, producing either a TRUE or FALSE result. If TRUE, the branch is taken; that is, 
the instruction causes a PC-relative, delayed control transfer to the address 

“PC +(4* sign ext (d16hi || d1610o) ).” If FALSE, the branch is not taken. 
































If the branch is taken, the delay instruction is always executed, regardless of the value of the 
annul bit. If the branch is not taken and the annul bit (a) is one, the delay instruction is 
annulled (not executed). 


The predict bit (p) gives the hardware a hint about whether the branch is expected to be 
taken. A one in the p bit indicates that the branch is expected to be taken; a zero indicates 
that the branch is expected not to be taken. 





Implementation Note — The UltraSPARC IIIi processor does not implement this 
instruction by tagging each register value. The UltraSPARC IIIi processor looks at the full 
64-bit register to determine a negative or zero. 
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Exceptions 


illegal instruction (if xcond = 000, or 1005) 





A.7 Branch on Floating-Point Condition Codes 
with Prediction (FBPfcc) 





















































Opcode cond Operation fcc Test 
FBPA 1000 Branch Always 1 
FBP 0000 Branch Never 0 
FBPU 0111 Branch on Unordered U 
FBPG 0110 Branch on Greater G 
FBPUG 0101 Branch on Unordered or Greater G or U 
FBPL 0100 Branch on Less L 
FBPUL 0011 Branch on Unordered or Less L or U 
FBPLG 0010 Branch on Less or Greater LorG 
FBPNE 0001 Branch on Not Equal L or Gor U 
FBPE 1001 Branch on Equal E 
FBPUE 1010 Branch on Unordered or Equal EorU 
FBPGE 1011 Branch on Greater or Equal EorG 
FBPUGE 1100 Branch on Unordered or Greater or Equal EorGorU 
FBPLE 1101 Branch on Less or Equal EorL 
FBPULE 1110 Branch on Unordered or Less or Equal EorLorU 
FBPO 1111 Branch on Ordered EorLorG 
Format (2) 
ONA DN CNN 
31 30 29 28 25 24 22 21 20 19 18 0 
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cc1 cc0 Condition Code 
00 fcc0 
01 fccl 
10 fcc2 
11 fcc3 

































































Assembly Language Syntax 

fba(,a)(,ptl,pn Sf£ccn, label 

fbn(,a)(,ptl,pn Sfcocn, label 

fbu(,a)(,ptl,pn Sfcocn, label 

fbg(,a)(,ptl,pn Sfcocn, label 
fbug(,a)(,ptl,pn) Sfcocn, label 

fbl(,a)(,ptl,pn Sf£ccn, label 
fbul(,a)(,ptl,pn) Sf£ccn, label 
fblg(,a)(,ptl,pn) Sf£ccn, label 
fbne(,a)(,ptl,pn) Sf£ccn, label (synonym: fbnz) 
fbe{,a}{, ptl, pn} $fccn, label (synonym: fbz) 
fbue(,a)(,ptl,pn) Sfcocn, label 
fbge(,a)t,ptl,pn) Sfcocn, label 
fbuge(,a)(,ptl,pn) Sfccn, label 
fble(,a)(,ptl,pn) Sfcocn, label 
fbule(,a)(,ptl,pn) Sfcocn, label 

Foo{,a}{,pt|,pn} Sfcocn, label 





Programming Note — To set the annul bit for FBP fcc instructions, append “, a” to the 
opcode mnemonic. For example, use “fbl,a $fcc3,label.” In the preceding table, 
braces signify that the “, a” is optional. To set the branch prediction bit, append either 

* , pt" (for predict taken) or “, pn” (for predict not taken) to the opcode mnemonic. If 
neither *, pt" nor *, pn" is specified, the assembler shall default to *, pt." To select the 
appropriate floating-point condition code, include *$£cc0;," “Sfccl,” “Sfcc2,” or 
*$fcc3" before the label. 


Description 


Unconditional branches and Fcc-conditional branches are described below. 
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Unconditional branches (FBPA, FBPN) — If its annul field is zero, an FBPN 
(Floating-Point Branch Never with Prediction) instruction acts like a NOP. If the Branch 
Never annul field is zero, the following (delay) instruction is executed; if the annul field is 
one, the following instruction is annulled (not executed). In no case does an FBPN cause 
a transfer of control to take place. 


F BPA (Floating-Point Branch Always with Prediction) causes an unconditional PC- 
relative, delayed control transfer to the address “PC + (4 x sign ext (disp19)).” If 
the annul field of the branch instruction is one, the delay instruction is annulled (not 
executed). If the annul field is zero, the delay instruction is executed. 


Fcc-conditional branches — Conditional FBP fcc instructions (except FBPA and FBPN) 
evaluate one of the four floating-point condition codes (fcc0, fccl, fcc2, fcc3) as 
selected by ccO and cc1, according to the cond field of the instruction, producing either 
a TRUE or FALSE result. If TRUE, the branch is taken, that is, the instruction causes a PC- 
relative, delayed control transfer to the address “PC + (4x sign ext (disp19)).” If 
FALSE, the branch is not taken. 




















If a conditional branch is taken, the delay instruction is always executed, regardless of the 
value of the annul field. If a conditional branch is not taken and the annul field (a) is one, 
the delay instruction is annulled (not executed). 


Note — The annul bit has a different effect on conditional branches than it does on 
unconditional branches. 





The predict bit (p) gives the hardware a hint about whether the branch is expected to be 
taken. A one in the p bit indicates that the branch is expected to be taken. A zero indicates 
that the branch is expected not to be taken. 

















If FPRS.FEF - 0 or PSTATE.PEF - 0, or if an FPU is not present, an FBP fcc instruction 
is not executed and instead, an fp. disabled exception is generated. 


Compatibility Note — Unlike SPARC-V8, SPARC-V9 does not require an instruction 
between a floating-point compare operation and a floating-point branch (FBfcc, FBPfcc). 


Exceptions 


fp. disabled 
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A.8 Branch on Integer Condition Codes with 
Prediction (BPcc) 














Opcode |cond [Operation icc Test 

BPA 1000 |Branch Always 1 

BPN 0000 |Branch Never 0 

BPNE 1001 |Branch on Not Equal not Z 

BPE 0001 |Branch on Equal Z 

BPG 1010 |Branch on Greater not (Z or (N xor V)) 


BPGE 1011 |Branch on Greater or Equal not (N xor V) 





BPL 0011 |Branch on Less N xor V 

BPGU 1100 Branch on Greater Unsigned not (C or Z) 

BPCC 1101 |Branch on Carry Clear (Greater Than or Equal, Unsigned) 
BPCS 0101 |Branch on Carry Set (Less than, Unsigned) 

BPPOS |1110 |Branch on Positive 














BPNEG |0110 |Branch on Negative 















































BPVC 1111 |Branch on Overflow Clear not V 
BPVS 0111 |Branch on Overflow Set V 
Format (2) 
COREEA = 
31 30 29 28 25 24 22 21 20 19 18 0 
cc1 cc0 Condition Code 
00 icc 
01 — 
10 xcc 
11 — 
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Assembly Language Syntax 





ba(,a)(,ptl,pn) i_or_x_cc, label 





bn(,a)(,ptl,pn) i_or_x_cc, label (or: iprefetch label) 





bne(,a)(,ptl,pn) i_or_x_cc, label (synonym: bnz) 


be(,a)(,ptl,pn) i_or_x_cc, label (synonym: bz) 
bg(,a)(,ptl,pn) i_or_x_cc, label 
ble(,a)(,ptl,pn i_or_x_cc, label 
bge(,a)(,ptl,pn i_or_x_cc, label 
bl(,a)(,ptl,pn) i_or_x_cc, label 
bgu(,a)(,ptl,pn i_or_x_cc, label 
bleu(,a)(,ptl,pn) i_or_x_cc, label 


| or x cc, label 
i or. x cc, label 
bneg(,a)(,ptl,pn) i_or_x_cc, label 


























bcsi,ali pt],pn 
bpos(,a)(,ptl,pn) 











Programming Note — To set the annul bit for BPcc instructions, append “, a” to the 
opcode mnemonic. For example, use “bgu, a $icc, label.” Braces in the preceding table 
signify that the “, a” is optional. To set the branch prediction bit, append to an opcode 
mnemonic either “, pt” for predict taken or “, pn” for predict not taken. If neither “, pt” 
nor “, pn” is specified, the assembler shall default to “, pt.” To select the appropriate integer 


condition code, include *$icc" or “%xcc” before the label. 


Description 


Unconditional branches and conditional branches are described below: 


- Unconditional branches (BPA, BPN) — A BPN (Branch Never with Prediction) 
instruction for this branch type (op2 = 1) is used in SPARC-V9 as an instruction prefetch; 
that is, the effective address (PC + (4 x sign ext (disp19) )) specifies an address of 
an instruction that is expected to be executed soon. If the Branch Never annul field is one, 
then the following (delay) instruction is annulled (not executed). If the annul field is zero, 
then the following instruction is executed. In no case does a Branch Never cause a transfer 
of control to take place. 
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BPA (Branch Always with Prediction) causes an unconditional PC-relative, delayed 
control transfer to the address “PC + (4 x sign ext (disp19) ).” If the annul field of 
the branch instruction is one, then the delay instruction is annulled (not executed). If the 
annul field is zero, then the delay instruction is executed. 


Conditional branches — Conditional BPcc instructions (except BPA and BPN) evaluate 
one of the two integer condition codes (icc or xcc), as selected by cc0 and cc1, 
according to the cond field of the instruction, producing either a TRUE or FALSE result. 
If TRUE, the branch is taken; that is, the instruction causes a PC-relative, delayed control 
transfer to the address “PC + (4 x sign ext (disp19) ).” If FALSE, the branch is not 
taken. 


























If a conditional branch is taken, the delay instruction is always executed regardless of the 
value of the annul field. If a conditional branch is not taken and the annul field (a) is one, 
the delay instruction is annulled (not executed). 





Note — The annul bit has a different effect for conditional branches than it does for 
unconditional branches. 





The predict bit (p) is used to give the hardware a hint about whether the branch is 
expected to be taken. A one in the p bit indicates that the branch is expected to be taken; 
a zero indicates that the branch is expected not to be taken. 


Exceptions 





illegal instruction (cc1 cc0 = 015 or 115) 














A.9 Call and Link 


Omm pe Dues 


Format (1) 
disp30 
31 30 29 0 
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Assembly Language Syntax 








call label 


Description 


The CALL instruction causes an unconditional, delayed, PC-relative control transfer to 
address PC + (4 x sign_ext(disp30)). Since the word displacement (disp30) field is 
30 bits wide, the target address lies within a range of —2?! to +23! — 4 bytes. The PC-relative 
displacement is formed by sign-extending the 30-bit word displacement field to 62 bits and 
appending two low-order zeroes to obtain a 64-bit byte displacement. 


The CALL instruction also writes the value of PC, which contains the address of the CALL, 
into r[15] (out register 7). 


Exceptions 


None 





A.10 


Compare and Swap 























Opcode op3 Operation 
CASAP^s 111100 Compare and Swap Word from Alternate Space 
CASXAPAS! 111110 Compare and Swap Extended from Alternate Space 
Format (3) 
op3 rs1 i=0 imm_asi rs2 


25 24 19 18 14 13 12 5 4 0 
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Assembly Language Syntax 

casa [reg,,;] imm. asi, reg,,», rg, 
casa [reg,s]] $asi,reg,.», reg, 
casxa [reg,,;] imm. asi, reg,,», regra 
casxa [reg,s]] $asi,reg,.», Têg,g 
Description 


Concurrent processes use these instructions for synchronization and memory updates. Uses 
of compare-and-swap include spin-lock operations, updates of shared counters, and updates 
of linked-list pointers. The last two can use wait-free (non-locking) protocols. 


The CASXA instruction compares the value in register r[rs2] with the doubleword in 
memory pointed to by the doubleword address in r[rs1]. If the values are equal, the value 
in r[rd] is swapped with the doubleword pointed to by the doubleword address in 
r[rs1]. Ifthe values are not equal, the contents of the doubleword pointed to by r[rs1] 
replaces the value in r [rd], but the memory location remains unchanged. 


The CASA instruction compares the low-order 32 bits of register r[rs2] with a word in 
memory pointed to by the word address in r[rs1]. If the values are equal, then the low- 
order 32 bits of register r [rd] are swapped with the contents of the memory word pointed 
to by the address in r [rs1] and the high-order 32 bits of register r [rd] are set to zero. If 
the values are not egual, the memory location remains unchanged, but the zero-extended 
contents of the memory word pointed to by r [rs1] replace the low-order 32 bits of r [rd] 
and the high-order 32 bits of register r [rd] are set to zero. 


A compare-and-swap instruction comprises three operations: load, compare, and swap. The 
overall instruction is atomic; that is, no intervening interrupts or deferred traps are 
recognized by the processor and no intervening update resulting from a compare-and-swap, 
swap, load, load-store unsigned byte, or store instruction to the doubleword containing the 
addressed location, or any portion of it, is performed by the memory system. 


A compare-and-swap operation does nof imply any memory barrier semantics. When 
compare-and-swap is used for synchronization, the same consideration should be given to 
memory barriers as if a load, store, or swap instruction were used. 


A compare-and-swap operation behaves as if it performs a store, either of a new value from 
r [rd] or of the previous value in memory. The addressed location must be writable, even if 
the values in memory and r[rs2] are not equal. 


If i — O, the address space of the memory location is specified in the imm_asi field; if 
i = 1, the address space is specified in the ASI register. 
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A mem, address not aligned exception is generated if the address in r [rs1] is not properly 
aligned. CASXA and CASA cause a privileged action exception if PSTATE.PRIV = 0 and 
bit 7 of the ASI is zero. 





The coherence and atomicity of memory operations between processors and I/O DMA 
memory accesses is maintained for cacheable memory space. 





Programming Note — Compare and Swap (CAS) and Compare and Swap Extended 
(CASX) synthetic instructions are available for “big-endian” memory accesses. Compare and 
Swap Little (CASL) and Compare and Swap Extended Little (CASXL) synthetic instructions 
are available for “little-endian” memory accesses. 





The compare-and-swap instructions do not affect the condition codes. 


Exceptions 


privileged action 

mem address not aligned 
data access exception 

data access error 

fast data access MMU miss 
fast data access protection 
PA watchpoint 

VA. watchpoint 
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A.11 DONE and RETRY 


























Opcode fen Operation 

DONEP 0 Return from Trap (skip trapped instruction) 
RETRYP 1 Return from Trap (retry trapped instruction) 
— 2-31 Reserved 

Format (3) 


qe e o OS 


31 30 29 25 24 19 18 0 


Assembly Language Syntax 


done 














retry 





Description 











The DONE and RETRY instructions restore the saved state from TSTATE (CWP, ASI, CCR, 
and PSTATE), set PC and nPC, and decrement TL. 

















The RETRY instruction resumes execution with the trapped instruction by setting 
PC <TPC[TL] (the saved value of PC on trap) and nPC <-TNPC[TL] (the saved value of 
nPC on trap). 





The DONE instruction skips the trapped instruction by setting PC <TNPC[TL] and 
nPC € TNPC[TL] +4. 














Execution of a DONE or RETRY instruction in the delay slot of a control-transfer instruction 
produces undefined results. 
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Programming Note — Use the DONE and RETRY instructions to return from privileged 
trap handlers. 





Exceptions 


privileged opcode 
illegal instruction (if TL = 0 or fen = 2-31) 





A.12 . Edge Handling Instructions (VIS I, VIS IT) 








Opcode opf Operation 

EDGE8 0 0000 0001 Eight 8-bit edge boundary processing, no condition codes 
EDGE8L 0 0000 0010 Eight 8-bit edge boundary processing, little-endian 

EDGE8L 0 0000 0011 Eight 8-bit edge boundary processing, little-endian, no condition 





codes 





EDGE16 0 0000 0100 Four 16-bit edge boundary processing 


EDGE16 0 0000 0101 Four 16-bit edge boundary processing, no condition codes 


EDGE16L 0 0000 0110 Four 16-bit edge boundary processing, little-endian 
EDGEl16LN |0 0000 0111 Four 16-bit edge boundary processing, little-endian, no condition 





codes 
EDGE32 0 0000 1000 Two 32-bit edge boundary processing 
EDGE32 0 0000 1001 Two 32-bit edge boundary processing, no condition codes 


EDGE32L 0 0000 1010 Two 32-bit edge boundary processing, little-endian 














EDGE32LN |0 0000 1011 Two 32-bit edge boundary processing, little-endian, no condition 
codes 











Format (3) 


31 30 29 25 24 19 18 14 13 5 4 0 
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Assembly Language Syntax 

edges Tegys], VEL p52 VEL pq 
edge8n Tegys 1, l'égys2, VEL rq 
edge8l VEZ ps 1 VCS ps5 Tê] 
edge8ln TCS ps 15 VEL 53, VEL pq 
edgel6 TCS ps 15 'égys2, VEL yd 
edgelón TCS ps 19 VCS ps5 VEL d 
edgel6l Têg,]; VCS ps5 '€g d 
edgel6ln Têg,]; VCS ps5 VEL d 
edge32 TCS ps 15 VEL p52 VEL pd 
edge32n TCL ps 1 VCS ps5 VEL d 
edge321 TCS ps 1 VCS ps5 VEL d 
edge321n TCS ps 1 VCS ps5 VEL d 
Description 


These instructions handle the boundary conditions for parallel pixel scan line loops, where 
srcl is the address of the next pixel to render and src2 is the address of the last pixel in 
the scan line. 

















EDGE8L(N), EDGE16L(N), and EDGE32L(N) are little-endian versions of EDGE8(N), 
EDGE16(N), and EDGE32(N). They produce an edge mask that is bit-reversed from their big- 
endian counterparts but are otherwise identical. This makes the mask consistent with the 
mask produced by the graphics compare operations (see Section A.44, “Pixel Compare 

(VIS 1)”) and with the Partial Store instruction (see Section A.41, “Partial Store (VIS I)") on 
little-endian data. 





















































A 2-bit (EDGE32), 4-bit (EDGE16), or 8-bit (EDGE8) pixel mask is stored in the least 
significant bits of r [rd]. The mask is computed from left and right edge masks as follows: 





























1. The left edge mask is computed from the three least significant bits (LSBs) of r[rs1], 
and the right edge mask is computed from the three LSBs of r [32], according to 
TABLE A-4 (TABLE A-5 for little-endian byte ordering). 





2. If 32-bit address masking is disabled (P STATE . AM = 0, 64-bit addressing) and the upper 
61 bits of r[rs1] are equal to the corresponding bits in r[rs2], r[rd] is set to the 
right edge mask ANDed with the left edge mask. 


3. If 32-bit address masking is enabled (PSTATE. AM = 1, 32-bit addressing) and bits 31:3 of 
r[rs1] match bits 31:3 of x [rs2], r [rd] is set to the right edge mask ANDed with 
the left edge mask. 





4. Otherwise, r [rd] is set to the left edge mask. 
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The integer condition codes are set per the rules of the SUBCC instruction with the same 
operands (see Section A.64, “Subtract”). 











The EDGE(8, 16, 32)(L)N instructions do not set the integer condition codes. 





Exceptions 


None 


TABLE A-4 Edge Mask Specification 


























Edge Size A2-A0 Left Edge Right Edge 
8 000 1111 1111 1000 0000 
8 001 0111 1111 1100 0000 
8 010 0011 1111 1110 0000 
8 011 0001 1111 1111 0000 
8 100 0000 1111 1111 1000 
8 101 0000 0111 1111 1100 
8 110 0000 0011 1111 1110 
8 111 0000 0001 1111 1111 
16 00x 1111 1000 

16 01x 0111 1100 

16 10x 0011 1110 

16 lix 0001 1111 

32 Oxx 11 10 

32 lxx 01 11 





TABLE A-5 Edge Mask Specification (Little-Endian) 























Edge Size A2-A0 Left Edge Right Edge 
8 000 1111 1111 0000 0001 
8 001 1111 1110 0000 0011 
8 010 1111 1100 0000 0111 
8 011 1111 1000 0000 1111 
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TABLE A-5 Edge Mask Specification (Little-Endian) (Continued) 




















Edge Size A2-A0 Left Edge Right Edge 
8 100 1111 0000 0001 1111 
8 101 1110 0000 0011 1111 
8 110 1100 0000 0111 1111 
8 111 1000 0000 1111 1111 
16 00x 1111 0001 

16 01x 1110 0011 

16 10x 1100 0111 

16 lix 1000 1111 

32 Oxx 11 01 

32 lxx 10 11 




















A.13 Floating-Point Add and Subtract 


FADDs 11 0100 0 0100 0001 Add Single 


FADDd 11 0100 0 0100 0010 Add Double 
FADDq 11 0100 0 0100 0011 Add Quad 


FSUBs 11 0100 0 0100 0101 Subtract Single 
FSUBd 11 0100 001000110 Subtract Double 


























FSUBq 11 0100 0 0100 0111 Subtract Quad 
Format (3) 
10 rd op3 rs1 opf 
31 30 29 25 24 19 18 14 13 
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Assembly Language Syntax 
fadds freg,sj. fregyso, fregya 
faddd freg,sj, fregyso, fregya 
faddg freg,s,, fregyso, fregya 
fsubs freg,sj. fregyso, fregya 
fsubd fre8,sj, fregyso, fregya 
fsubg fregis egrs» fregya 

Description 


The floating-point add instructions add the floating-point register(s) specified by the rs1 
field and the floating-point register(s) specified by the rs2 field. The instructions then write 
the sum into the floating-point register(s) specified by the rd field. 


The floating-point subtract instructions subtract the floating-point register(s) specified by the 
rs2 field from the floating-point register(s) specified by the rs1 field. The instructions then 
write the difference into the floating-point register(s) specified by the rd field. 


Rounding is performed as specified by the FSR.RD field. 





Compatibility Note — When FSR.NS - 0, the processor operates in standard floating- 
point mode. FADD or FSUB with a subnormal result causes an fp exception other exception 
with FSR. ftt — unfinished FPop, system software emulates the instruction, and the correct 
numerical result is calculated. 


When FSR.NS = 1, the processor operates in “nonstandard” floating-point mode. When 
FSR.NS = 1, and FADD or FSUB produces a subnormal result on an UltraSPARC IIIi 
processor, a fp. exception other exception occurs with FSR.ftt = unfinished FPop (even 
though the processor is operating in nonstandard floating-point mode), then system software 
emulates the instruction, and the correct numerical result is calculated (instead of replacing 
the result with zero). 


Therefore, the processor may produce a different (albeit more accurate) result than in 
previous processors in the following situation: 


FADD or FSUB produces a subnormal result 
FSR.NS=1 
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Notes — 

1) The processor does not implement (in hardware) the instructions that refer to a quad 
floating-point register. Execution of such an instruction generates fp. exception other (with 
ftt = unimplemented FPop), which causes a trap. Supervisor software then emulates these 
instructions. 





2) For FADDs, FADDd, FSUBs, FSUBd, an fp exception other with £tt = unfinished FPop 
can occur if either operand is NaN. 





Exceptions 


fp. disabled 

fp exception ieee 754 (OF, UF, NX, NV) 

fp. exception other (ftt = unimplemented FPop (FADDq and FSUBq only)) 

fp exception other (£tt = unifinished FPop (FADDs, FADDd, FSUBs, FSUBd only)) 








A.14 Floating-Point Compare 





Opcode op3 opf Operation 
FCMPs 0 0101 0001 Compare Single 
FCMPd 0 0101 0010 Compare Double 





FCMPq 0 0101 0011 Compare Quad 

ECMPES 00101 0101 Compare Single and Exception if Unordered 
FCMPEd 0 0101 0110 Compare Double and Exception if Unordered 
FCMPEq 11 0101 0 0101 0111 Compare Quad and Exception if Unordered 
































Format (3) 


[FF € DI € I 


31 30 29 27 26 25 24 19 18 14 13 5 4 0 
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Assembly Language Syntax 

fcmps $fccn,freg,.j, freg so 
fcmpd $fccn,freg,sj, ffeg,so 
fcmpa $fccn,freg,.j, freg,so 
fcmpes $fccn,freg,.j, freg,so 
fcmped $fccn,freg,.j, freg,so 
fcmpeg Siccn, fregys1, freg,s2 
cc1 [] cc0 Condition Code 

00 fcc0 

01 

10 £cc2 

11 fcc3 
Description 





These instructions compare the floating-point register(s) specified by the rs1 field with the 


floating-point register(s) specified by the rs2 field, and set the selected floating-point 


condition code (fccn) as shown below. 














fcc value Relation 

0 fregz; —fregysa 

1 fregys, < fregys) 

2 fregys, > fregys) 

3 Sreg,s1 ? freg,s2 (unordered) 








The “?” in the preceding table means that the comparison is unordered. The unordered 


condition occurs when one or both of the operands to the compare is a signalling or quiet 


NaN. 


The “compare and cause exception if unordered” (FCMPEs, FCMPEd, and FCMPE 














instructions cause an invalid (NV) exception if either operand is a NaN. 


FCMP causes an invalid (NV) exception if either operand is a signalling NaN. 
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Compatibility Note — Unlike SPARC-V8, SPARC-V9 does not require an instruction 
between a floating-point compare operation and a floating-point branch (FBfcc, FBPfcc). 


SPARC-V8 floating-point compare instructions are required to have a zero in the r [rd] 
field. In SPARC-V9, bits 26 and 25 of the r [rd] field specify the floating-point condition 
code to be set. Legal SPARC-V8 code will work on SPARC-V9 because the zeroes in the 
r[rd] field are interpreted as fccO and the FBfcc instruction branches according to 
fccO. 





Note — The processor does not implement (in hardware) the instructions that refer to a quad 
floating-point register. Execution of such an instruction generates fp exception other (with 
ftt — unimplemented FPop), which causes a trap. Supervisor software then emulates these 
instructions. 


Exceptions 


fp. disabled 
fp exception ieee 754 (NV) 
fp exception other (ftt = unimplemented FPop (FCMPq, FCMPEg only)) 








A.15 . Convert Floating-Point to Integer 





Opcode op3 opf Operation 
FsTOx 11 0100 0 1000 0001 Convert Single to 64-bit Integer 


FdTOx 11 0100 0 1000 0010 Convert Double to 64-bit Integer 
FqTOx 11 0100 010000011 Convert Ouad to 64-bit Integer 


FsTOi 11 0100 0 1101 0001 Convert Single to 32-bit Integer 
FdTOi 11 0100 0 1101 0010 Convert Double to 32-bit Integer 


FqTOi 11 0100 0 1101 0011 Convert Quad to 32-bit Integer 
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10 
31 30 29 


rd 


Format (3) 


25 24 19 18 14 13 5 4 0 


Assembly Language Syntax 











fstoi egrs» fregya 














Description 


FsTOx, FdTOx, and FqTOx convert the floating-point operand in the floating-point 
register(s) specified by rs2 to a 64-bit integer in the floating-point register(s) specified by 
rd. 


FsTOi, FdTOi, and FqTOi convert the floating-point operand in the floating-point 
register(s) specified by rs2 to a 32-bit integer in the floating-point register specified by rd. 


The result is always rounded toward zero; that is, the rounding direction (RD) field of the 
FSR register is ignored. 


If the floating-point operand’s value is too large to be converted to an integer of the specified 
size or is a NaN or infinity, then a fp exception ieee 754 “invalid” exception occurs. 


Note — The processor does not implement (in hardware) the instructions that refer to a quad 
floating-point register. Execution of such an instruction generates fp. exception other (with 
ftt = unimplemented FPop), which causes a trap. Supervisor software then emulates these 
instructions. 
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The following floating-point-to-integer conversion instructions generate an unfinished_FPop 
exception for certain ranges of floating-point operands, as shown in TABLE A-6. 


TABLE A-6 Floating-Point to Integer unfinished FPop Exception Conditions 























Instruction Unfinished Trap Ranges 

FsTOi result « — 22r result 2 23. Inf, NaN 

FsTOx |result| > 252, Inf, NaN 

FdTOi result < — 2?!, result > 23!, Inf, NaN 

FdTOx |result| > 252, Inf, NaN 
Exceptions 


fp. disabled 

fp exception ieee 754 (NV, NX) 

unfinished FPop 

fp exception other (ftt = unimplemented FPop (FqTOi, FqTOx only)) 





A.16 | Convert Between Floating-Point Formats 


Opcode Operation 

FsTOd 0 1100 1001 Convert Single to Double 
sTOq 0 1100 1101 Convert Single to Quad 
FdTOs 0 1100 0110 Convert Double to Single 
FqTOs 110100 0 1100 0111 Convert Quad to Single 
FqTOd 110100 0 1100 1011 Convert Quad to Double 





























Format (3) 
31 30 29 25 24 19 18 14 13 5 4 0 
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Assembly Language Syntax 


























fstod Tegyso, fregyq 
fstog Teg,» frega 
fdtos Veg», frega 
fdtog Teg, frega 
tqtos Végysp, fregyq 
fatod Têg,s2, Sra 
Description 


These instructions convert the floating-point operand in the floating-point register(s) specified 
by rs2 to a floating-point number in the destination format. They write the result into the 
floating-point register(s) specified by rd. 


Rounding is performed as specified by the FSR. RD field. 


FqTOd, FgTOs, and FdTOs (the “narrowing” conversion instructions) can raise OF, UF, and 
NX exceptions. FdTOg, FsTOq, and FsTOd (the “widening” conversion instructions) cannot. 


Any of these six instructions can trigger an NV exception if the source operand is a 
signalling NaN. 





Notes — 

1) The UltraSPARC IIli processor does not implement (in hardware) the instructions that 
refer to a quad floating-point register. Execution of such an instruction generates 

fp exception other (with ftt = unimplemented FPop), which causes a trap. Supervisor 
software then emulates these instructions. 


2) For FdTOs and FsTOd, a fp. exception other with ftt = unfinished FPop can occur if 
the source operand is NaN or subnormal, or out of range of the destination format. 


The following floating-point to floating-point conversion instructions generate an 
unfinished FPop exception for certain ranges of floating-point operands, as shown in 
TABLE A-7. 


TABLE A-7  Floating-Point/Floating-Point unfinished FPop Exception Conditions 





Instruction Unfinished Trap Ranges 





FdTOs |result| > 252, |result| <23!, operand < — 222, operand > 222, NaN 
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Exceptions 


fp. disabled 

fp exception ieee 754 (OF, UF, NV, NX) 

fp. exception other (£tt = unimplemented FPop (FsTOq, FdTOg, FqTOs, FqTOd only)) 
unfinished FPop 

fp exception other (£tt = unfinished FPop (FdTOs and FsTOd only)) 








A.17 . Convert Integer to Floating-Point 





























Opcode opf Operation 
FxTOs 0 1000 0100 Convert 64-bit Integer to Single 
FxTOd 0 1000 1000 Convert 64-bit Integer to Double 
FxTOq 0 1000 1100 Convert 64-bit Integer to Quad 
FiTOs 0 1100 0100 Convert 32-bit Integer to Single 
FiTOd 0 1100 1000 Convert 32-bit Integer to Double 
FiTOq 0 1100 1100 Convert 32-bit Integer to Quad 
Format (3) 
31 30 29 25 24 19 18 14 13 5 4 0 


Assembly Language Syntax 








fxtos fregrsz fregya 
fxtod S859, Hrd 
fxtog SS p53 fegra 
fitos SS p52 fregya 








fitod fi€,s2 Jegra 





fitog fregrs» fregyq 





306 UltraSPARC Illi Processor User's Manual * June 2003 


Description 





FxTOs, FxTOd, and FxTOq convert the 64-bit signed integer operand in the floating-point 
registers specified by rs2 into a floating-point number in the destination format. 

















FiTOs, FiTOd, and FiTOg convert the 32-bit signed integer operand in floating-point 
register(s) specified by rs2 into a floating-point number in the destination format. All write 
their result into the floating-point register(s) specified by rd. 


FiTOs, FxTOs, and FxTOd round as specified by the FSR. RD field. 





Note — The UltraSPARC IIli processor does not implement (in hardware) the instructions 
that refer to a quad floating-point register. Execution of such an instruction generates 

fp. exception other (with ftt = unimplemented FPop), which causes a trap. Supervisor 
software then emulates these instructions. 





The following integer-to-floating-point conversion instructions generate an unfinished FPop 
exception for certain ranges of integer operands, as shown in TABLE A-8. 


TABLE A-8 Integer/Floating-Point unfinished FPop Exception Conditions 




















Instruction Unfinished Trap Ranges 

FiTOs operand « — Oe: operand = 22 

FxTOs operand « — 222. operand 2 22 

FxTOd operand « — 21, operand > 251 
Exceptions 


fp. disabled 

fp exception ieee 754 (NX (FiTOs, FxTOs, FxTOd only)) 

unfinished FPop 

fp exception other (ftt = unimplemented FPop (FiTOq, FxTOg only)) 
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A.18 


31 30 29 


308 


Floating-Point Move 






































Opcode op3 opf Operation 
FMOVS 11 0100 0 0000 0001 Move Single 
FMOVd 11 0100 0 0000 0010 Move Double 
FMOVq 11 0100 0 0000 0011 Move Quad 
FNEGs 11 0100 0 0000 0101 Negate Single 
FNEGd 11 0100 0 0000 0110 Negate Double 
FNEGq 11 0100 0 0000 0111 Negate Quad 
FABSS 11 0100 0 0000 1001 Absolute Value Single 
FABSd 11 0100 0 0000 1010 Absolute Value Double 
FABSq 11 0100 0 0000 1011 Absolute Value Quad 
Format (3) 
op3 — opf 
25 24 19 18 14 13 





Assembly Language Syntax 























fmovs fregs» frega 
fmovd fregs» frega 
Fmovq freg,so, fregyg 
fnegs fregyso, frega 
fnegd SreSys2 frega 
fnegg fYê8,s2. frega 
fabss Jr&ys2. fregya 
fabsd STS 3 HS ra 
tabsq fregyso, frega 
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rs2 


Description 


The single-precision versions of these instructions copy the contents of a single-precision 
floating-point register to the destination. The double-precision versions copy the contents of 
a double-precision floating-point register to the destination. The quad-precision versions 
copy a quad-precision value in floating-point registers to the destination. 


FMOV copies the source to the destination unaltered. 





FNEG copies the source to the destination with the sign bit complemented. 








FABS copies the source to the destination with the sign bit cleared. 


These instructions do not round. 


Note — The processor does not implement (in hardware) the instructions that refer to a quad 
floating-point register. Execution of such an instruction generates fp. exception other (with 
ftt = unimplemented FPop), which causes a trap. Supervisor software then emulates these 
instructions. 


Exceptions 


fp. disabled 
fp exception other (ftt = unimplemented FPop (FMOVq, FNEGq, FABSq only)) 








Chapter A Instruction Definitions 309 





A.19 Floating-Point Multiply and Divide 














Opcode Operation 

FMULS 0 0100 1001 Multiply Single 

FMULd 0 0100 1010 Multiply Double 

FMULq 0 0100 1011 Multiply Quad 

FsMULd 00110 1001 Multiply Single to Double 
FdMULq 00110 1110 Multiply Double to Quad 








FDIVs 00100 1101 Divide Single 
FDIVd 001001110 Divide Double 
FDIVq 00100 1111 Divide Quad 




















Format (3) 
31 30 29 25 24 19 18 14 13 5 4 0 


























[Assembly Language Syntax | 
fmuld fregys y, fregyso, frega 
fmulg fr€&ys], fregyso, fregya 
tsmuld |fr€g,sj freSps2, fegra 
fdmulg fIêSysi, MS rsa frega 
fdivs fregys p, fregyso, frega 
fdivd SEs 1, fregyso, frega 
fdivg fIêSysi, HS rsa HE ra 
Description 


The floating-point multiply instructions multiply the contents of the floating-point register(s) 
specified by the rs1 field by the contents of the floating-point register(s) specified by the 
rs2 field. The instructions then write the product into the floating-point register(s) specified 
by the rd field. 
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The F sMULd instruction provides the exact double-precision product of two single-precision 
operands, without underflow, overflow, or rounding error. Similarly, FdMULg provides the 
exact quad-precision product of two double-precision operands. 


The floating-point divide instructions divide the contents of the floating-point register(s) 
specified by the rs1 field by the contents of the floating-point register(s) specified by the 
rs2 field. The instructions then write the quotient into the floating-point register(s) specified 
by the rd field. 


Rounding is performed as specified by the FSR.RD field. 





Notes — 

1) The processor does not implement (in hardware) the instructions that refer to a quad 
floating-point register. Execution of such an instruction generates fp_exception_other (with 
ftt = unimplemented FPop), which causes a trap. Supervisor software then emulates these 
instructions. 


2) For FDIVs and FDIVd, a fp exception other with ftt = unfinished FPop can occur if 
the divide unit detects certain unusual conditions. 


Exceptions 


fp. disabled 

fp exception ieee 754 (OF, UF, DZ (FDIV only), NV, NX) 

fp exception other (ftt = unimplemented FPop (FMULq, FdMULg, FDIVq) 

fp. exception other (ftt = unifinished FPop (FMULs, FMULd, FSMULd, FDIVs, FDIV)) 
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A.20 Floating-Point Square Root 












































Opcode op3 opf Operation 
FSORTs 11 0100 0 0010 1001 Square Root Single 
FSQRTd 11 0100 0 0010 1010 Square Root Double 
FSORTq 11 0100 000101011 Sguare Root Ouad 
Format (3) 
31 30 29 25 24 19 18 14 13 5 4 0 

Assembly Language Syntax 
fsqrts fr€8,s2, Mra 
fsqrtd SES ps3, fregya 
fsqrtq Sern Mra 

Description 


These SPARC-V9 instructions generate the square root of the floating-point operand in the 
floating-point register(s) specified by the rs2 field and place the result in the destination 
floating-point register(s) specified by the rd field. Rounding is performed as specified by the 


FSR.RD field. 


Note — The processor does not implement (in hardware) the instructions that refer to a quad 
floating-point register. Execution of such an instruction generates fp exception other (with 
ftt = unimplemented FPop), which causes a trap. Supervisor software then emulates these 


instructions. 





For FSORTs and FSORTd a fp. exception other (with ftt = unfinished FPop) can occur if 
the operand to the square root is positive denormalized. 
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Exceptions 


fp. disabled 

fp exception ieee 754 (IEEE 754 exception (NV, NX)) 
fp exception other (unimplemented FPop) (Quad forms) 
fp. exception other (unfinished FPop) (FSORTs, FSORTd) 





A.21 Flush Instruction Memory 

















Opcode op3 Operation 
FLUSH 111011 Flush Instruction Memory 
Format (3) 


BICI ICH - | 


31 30 29 25 24 19 18 14 13 12 5 4 0 


Assembly Language Syntax 








flush [address 








Description 


FLUSH ensures that the doubleword specified as the effective address is consistent across any 
local caches, and in a multiprocessor system, will eventually become consistent everywhere. 


In the following discussion Pry ysy refers to the processor that executed the FLUSH 
instruction. 


FLUSH ensures that instruction fetches from the specified effective address by Ppp ysg appear 
to execute after any loads, stores, and atomic load-stores to that address issued by Prysg 
prior to the FLUSH. In a multiprocessor system, FLUSH also ensures that these values will 
eventually become visible to the instruction fetches of all other processors. FLUSH behaves 
as if it were a store with respect to MEMBAR-induced orderings. See Section A.34, “Memory 
Barrier.” 
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314 


The effective address operand for the FLUSH instruction is “r[rs1]+r[rs2]”ifi=0, 
or*r[rs1] + sign_ext (simm13)” if i = 1. The least significant two address bits of the 
effective address are unused and should be supplied as zeroes by software. Bit 2 of the 
address is ignored because FLUSH operates on at least a doubleword. 


Programming Note — 


1. Typically, FLUSH is used in self-modifying code. The use of self-modifying code is 
discouraged. 


2. The order in which memory is modified can be controlled by means of FLUSH and 
MEMBAR instructions interspersed appropriately between stores and atomic load-stores. 
FLUSH is needed only between a store and a subsequent instruction fetch from the 
modified location. When multiple processes may concurrently modify live (that is, 
potentially executing) code, the programmer must ensure that the order of update 
maintains the program in a semantically correct form at all times. 





3. The memory model guarantees in a uniprocessor that data loads observe the results of the 
most recent store, even if there is no intervening FLUSH. 


4. FLUSH may be time consuming. 


5. In a multiprocessor system, the time it takes for a FLUSH to take effect is dependent on 
the system. No mechanism is provided to ensure or test completion. 


6. Because FLUSH is designed to act on a doubleword and on some implementations FLUSH 
may trap to system software, system software should provide a user-callable service 
routine for flushing arbitrarily sized regions of memory. On some processor 
implementations, this routine would issue a series of FLUSH instructions; on others, it 
might issue a single trap to system software that would then flush the entire region. 


On an UltraSPARC IlIi processor: 
- A FLUSH instruction flushes the processor pipeline and synchronizes the processor. 


- The instruction cache is kept coherent; therefore, there is no need to perform any action 
on it. 


- The address provided with the FLUSH instruction is ignored. However, for portability 
across all SPARC-V9 implementations, software must supply the target effective address 
in FLUSH instructions. 


FLUSH synchronizes code and data spaces after code space is modified during program 
execution. The FLUSH effective address is ignored. FLUSH does not access the data MMU 
and cannot generate a data MMU miss or exception. 


SPARC-V9 specifies that the FLUSH instruction has no latency on the issuing processor. In 
other words, a store to instruction space prior to the FLUSH instruction is visible immediately 
after the completion of FLUSH. When a FLUSH operation is performed, the processor 
guarantees that earlier code modifications will be visible across the whole system. 
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Exceptions 


None 





A.22 


31 30 29 


Flush Register Windows 























Opcode op3 Operation 
FLUSHW 10 1011 Flush Register Windows 
Format (3) 
op3 — i-0 — 


25 24 19 18 14 13 12 0 





Assembly Language Syntax 


Ens | — — ——] 


Description 


FLUSHW causes all active register windows except the current window to be flushed to 
memory at locations determined by privileged software. FLUSHW behaves as a NOP if there 
are no active windows other than the current window. At the completion of the FLUSHW 
instruction, the only active register window is the current one. 


Programming Note — The FLUSHW instruction can be used by application software to 
switch memory stacks or to examine register contents for previous stack frames. 





FLUSHW acts as a NOP if CANSAVE = NWINDOWS — 2. Otherwise, there is more than one 
active window, so FLUSHW causes a spill exception. The trap vector for the spill exception is 
based on the contents of OTHERWIN and WSTATE. The spill trap handler is invoked with the 
CWP set to the window to be spilled (that is, (CWP + CANSAVE + 2) mod NWINDOWS). 
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Programming Note — Typically, the spill handler saves a window on a memory stack and 


returns to re-execute the FLUSHW instruction. Thus, FLUSHW traps and re-executes until all 
active windows other than the current window have been spilled. 


Exceptions 


spill n normal 
spill n other 





A29 


Illegal Instruction Trap 

















Opcode op op2 Operation 
ILLTRAP 00 000 illegal instruction trap 
Format (2) 


C a a 


31 30 29 


316 


25 24 22 21 0 





Assembly Language Syntax 


Description 


The ILLTRAP instruction causes an illegal_instruction exception. The const22 value is 
ignored by the hardware; specifically, this field is not reserved by the architecture for any 
future use. 


Compatibility Note — Except for its name, this instruction is identical to the SPARC-V8 
UNIMP instruction. 
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Exceptions 


illegal_instruction 





A.24 Jump and Link 














Opcode op3 Operation 
JMPL 11 1000 Jump and Link 
Format (3) 


I [Te [A - |= 


31 30 29 25 24 19 18 14 13 12 5 4 0 





Assembly Language Syntax 


Description 


The JMPL instruction causes a register-indirect delayed control transfer to the address given 
by *r[rs1] + r[rs2]” if i=0, or “r[rsl] +sign_ext (simm13)” ifi=1. 





The JMPL instruction copies the PC, which contains the address of the JMPL instruction, into 
register r[rd]. 


If either of the low-order two bits of the jump address is nonzero, a 
mem_address_not_aligned exception occurs. 
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Programming Note — A JMPL instruction with rd = 15 functions as a register-indirect 
call using the standard link register. 


JMPL with rd = 0 can be used to return from a subroutine. The typical return address is 
“r[31] + 8,” if a nonleaf routine (one that uses the SAVE instruction) is entered by a CALL 
instruction, or “r[15] + 8" if a leaf routine (one that does not use the SAVE instruction) is 
entered by a CALL instruction or by a JMPL instruction with rd = 15. 








Exceptions 


mem address not aligned 





A.25 Load Floating-Point 


LDF Load Floating-Point Register 





LDDF Load Double Floating-Point Register 





LDOF Load Quad Floating-Point Register 











— 10 0001 2-31 Reserved 











Î Encoded floating-point register value. 


Format (3) 


BIG IL 1HE — |]. 


31 30 29 25 24 19 18 14 13 12 5 4 0 
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Assembly Language Syntax 














ld address], freg,g 
ldd [address], freg,g 
ldq [address], freg,q 
ldx [address], $f£sr 
Description 


The load single floating-point instruction (LDF) copies a word from memory into £ [rd]. 


The load doubleword floating-point instruction (LDDF) copies a word-aligned doubleword 
from memory into a double-precision floating-point register. 


The load quad floating-point instruction (LDQF) traps to software. 


The load floating-point state register instruction (LDXFSR) waits for all FPop instructions 
that have not finished execution to complete and then loads a doubleword from memory into 
the FSR. 


Load floating-point instructions access the primary address space (ASI = 80,6). The effective 
address for these instructions is “r[rsl] +r[rs2]” if i=0, or 
“r[rsl] sign ext (simm13)” ifi=1. 


LDF causes a mem_address_not_aligned exception if the effective memory address is not 
word aligned. LDXFSR causes a mem address not aligned exception if the address is not 
doubleword aligned. If the floating-point unit is not enabled (per FPRS.FEF and 
PSTATE.PEF) or if no FPU is present, then a load floating-point instruction causes an 
fp. disabled exception. 

















LDDF requires doubleword aligned. If word alignment is used, then the LDDF causes an 
LDDF mem address not aligned exception. The trap handler software shall emulate the 
LDDF instruction and return. 











Programming Note — In SPARC-V8, some compilers issued sequences of single- 
precision loads when they could not determine that doubleword or quadword operands were 
properly aligned. For SPARC-V9, since emulation of misaligned loads is expected to be fast, 
compilers are recommended to issue sets of single-precision loads only when they can 
determine that doubleword or quadword operands are not properly aligned. 


If a load floating-point instruction traps with any type of access error, the contents of the 
destination floating-point register(s) is undefined. 
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In the UltraSPARC IIIi processor, an LDDF instruction causes an 
LDDF mem address not aligned trap if the effective address is 32-bit aligned but not 64-bit 
(doubleword) aligned. 





Exceptions 


illegal instruction (op3 = 21,6 and rd = 2-31) 


fp. disabled 


LDDF. mem address not aligned (LDDF only) 
mem address not aligned 

data access exception 

PA watchpoint 

VA. watchpoint 

data access error 





fast data access MMU miss 
fast data access protection 








A.26 


11 


320 


Load Floating-Point from Alternate Space 











Opcode op3 rd Operation 

LDFAPAS 110000 |0—31 [Load Floating-Point Register from Alternate Space 
LDDFAPASI 11 0011 |f Load Double Floating-Point Register from Alternate Space 
LDQFAPAs! 

















+ Encoded floating-point register value. 


Format (3) 
op3 rs1 i-0 imm asi rs2 
op3 rs1 i=1 simm13 
25 24 19 18 14 13 12 5 4 0 
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Assembly Language Syntax 

lda [regaddr] imm. asi, freg,g 
lda [reg plus imm] %asi, freg,q 
ldda [regaddr] imm. asi, freg, 
ldda [reg plus imm] %asi, freg,g 
ldqa [regaddr] imm. asi, freg, 
ldga [reg plus imm] %asi, freg,g 
Description 


The load single floating-point from alternate space instruction (LDFA) copies a word from 
memory into f [rd]. 


The load double floating-point from alternate space instruction (LDDFA) copies a word- 
aligned doubleword from memory into a double-precision floating-point register. 


The load quad floating-point from alternate space instruction (LDOF A) traps to software. 


Load floating-point from alternate space instructions contain the address space 

identifier (AST) to be used for the load in the imm asi field if i = 0, or in the ASI register 
if i = 1. The access is privileged if bit 7 of the ASI is zero; otherwise, it is not privileged. 
The effective address for these instructions is “r [rs1] ^ r[rs2]"if i = 0, or 

“r[rsl] -* sign ext(simm13)"ifi-l. 


LDFA causes a mem address not aligned exception if the effective memory address is not 
word aligned. If the floating-point unit is not enabled (per FPRS . FEF and PSTATE. PEF) or 
if no FPU is present, then load floating-point from alternate space instructions cause an 

fp. disabled exception. 

















LDDFA with certain target ASIs is defined to be a 64-byte block-load instruction. See 
Section A.4, *Block Load and Block Store (VIS I)" for details. 


Implementation Note — LDFA and LDDFA cause a privileged action exception if 
PSTATE.PRIV - O and bit 7 of the ASI is zero. 








LDDF requires doubleword alignment. If word alignment is used, then the LDDF causes an 
LDDF. mem address not aligned exception. The trap handler software shall emulate the 
LDDF instruction and return. 
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Programming Note — In SPARC-V8, some compilers issued sequences of single- 
precision loads when they could not determine that doubleword or quadword operands were 
properly aligned. For SPARC-V9, since emulation of misaligned loads is expected to be fast, 
compilers should issue sets of single-precision loads only when they can determine that 
doubleword or quadword operands are not properly aligned. 


If a load floating-point instruction traps with any type of access error, the contents of the 
destination floating-point register(s) is undefined. 


In the UltraSPARC IlIi processor, an LDDFA instruction causes an 
LDDF_mem_address_not_aligned trap if the effective address is 32-bit aligned but not 64-bit 
(doubleword) aligned. 





Exceptions 


illegal_instruction (LDQFA only) 


fp. disabled 


LDDF. mem address not aligned (LDDFA only) 
mem address not aligned 

privileged action 

data access exception 

data access error 





fast data access MMU miss 
fast data access protection 





VA watchpoint 
PA watchpoint 





A.27 


322 


Load Integer 




















Opcode Operation 

LDSB Load Signed Byte 

LDSH Load Signed Halfword 
LDSW Load Signed Word 

LDUB Load Unsigned Byte 
LDUH Load Unsigned Halfword 
LDUW Load Unsigned Word 
LDX Load Extended Word 
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11 


11 


31 30 29 


rd 


rd 


Format (3) 


op3 rs1 i-0 — rs2 
op3 rs1 i=1 simm13 
25 24 19 18 14 13 12 5 4 0 
































Assembly Language Syntax 

ldsb address], reg, 

ldsh address], reg, 

ldsw address], reg, 

ldub address], reg, 

lduh address], reg, 

lduw address], reg, (synonym: 1d) 
ldx address], reg, 

Description 


The load integer instructions copy a byte, a halfword, a word, or an extended word from 
memory. All copy the fetched value into r [rd]. A fetched byte, halfword, or word is right- 
justified in the destination register r [rd]; it is either sign-extended or zero-filled on the left, 
depending on whether the opcode specifies a signed or unsigned operation, respectively. 


Load integer instructions access the primary address space (ASI — 80,6). The effective 
address is “r[rsl] Fr[rs2]”ifi-0,or“r[rsl] * sign ext (simm13)"if i - lI. 


A successful load (notably, load extended) instruction operates atomically. 


LDUH and LDSH cause a mem address not aligned exception if the address is not halfword 
aligned. LDUW and LDSW cause a mem address not aligned exception if the effective 
address is not word aligned. LDX causes a mem address not aligned exception if the address 
is not doubleword aligned. 


Compatibility Note — The SPARC-V8 LD instruction has been renamed LDUW in 
SPARC-V9. The LDSW instruction is new in SPARC-V9. 
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Exceptions 


mem_address_not_aligned (all except LDSB, LDUB) 
data_access_exception 

data_access_error 

fast_data_access_MMU_miss 
fast_data_access_protection 

VA_watchpoint 

PA_watchpoint 








A.28 Load Integer from Alternate Space 





























Opcode Operation 

LDSBAFAS Load Signed Byte from Alternate Space 
LDSHAPASI Load Signed Halfword from Alternate Space 
LD SWAFASI Load Signed Word from Alternate Space 
LDUBAFASI Load Unsigned Byte from Alternate Space 
LDUHAPASI Load Unsigned Halfword from Alternate Space 
LDUWAFASI Load Unsigned Word from Alternate Space 
LDXAPASI Load Extended Word from Alternate Space 
Format (3) 


BD I ]" HE = | 


31 30 29 25 24 19 18 14 13 12 5 4 0 
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Assembly Language Syntax 





ldsba [regaddr] imm. asi, reg,q 





ldsha [regaddr] imm. asi, reg, 


ldswa [regaddr] imm. asi, reg, 





lduba [regaddr] imm asi, reg, 


lduha [regaddr] imm asi, reg, 





lduwa [regaddr] imm asi, reg, (synonym: 1da) 





ldxa [regaddr] imm. asi, reg, 


ldsba [reg plus imm] %asi, reg, 





ldsha [reg plus imm] %asi, reg, 
ldswa [reg plus imm] %asi, reg,g 


lduba [reg. plus imm] %asi, reg,g 





lduha [reg plus imm] %asi, reg,g 














lduwa [reg plus imm] $asi,reg,g (synonym: 1da) 
ldxa [reg. plus imm] %asi, reg, 
Description 


The load integer from alternate space instructions copy a byte, halfword, word, or an 
extended word from memory. All copy the fetched value into r [rd]. A fetched byte, 
halfword, or word is right-justified in the destination register r [ rd]; it is either sign- 
extended or zero-filled on the left, depending on whether the opcode specifies a signed or 
unsigned operation, respectively. 


The load integer from alternate space instructions contain the address space identifier (AST) 
to be used for the load in the imm asi field if i = 0, or in the ASI register if i = 1. The 
access is privileged if bit 7 of the ASI is zero; otherwise, it is not privileged. The effective 
address for these instructions is “r[rsl] +r[rs2]” if i=0, or 

“r[rsl] +sign_ext (simm13)” ifi=1. 


A successful load (notably, load extended) instruction operates atomically. 


LDUHA and LDSHA cause a mem_address_not_aligned exception if the address is not 
halfword aligned. LDUWA and LDSWA cause a mem_address_not_aligned exception if the 
effective address is not word aligned; LDXA causes a mem_address_not_aligned exception if 
the address is not doubleword aligned. 


These instructions cause a privileged action exception if PSTATE.PRIV = 0 and bit 7 of the 
ASI is zero. 
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Exceptions 


privileged_action 

mem_address_not_aligned (all except LDSBA and LDUBA) 
data_access_exception 

PA_watchpoint 

VA_watchpoint 

fast data access MMU miss 

fast data, access protection 

data access error 








A.29 Load Quadword, Atomic (VIS I) 











Opcode imm asi ASI Value Operation 

LDDA ASI NUCLEUS QUAD LDD 2416 128-bit atomic load 

LDDA ASI NUCLEUS QUAD LDD L |2Ci¢ 128-bit atomic load, little-endian 
LDDA ASI QUAD LDD PHYS 3416 128-bit atomic load 

LDDA ASI QUAD LDD PHYS L 3Cig 128-bit atomic load, little-endian 














Format (3) LDDA 


m m | —— | 


31 30 29 25 24 19 18 14 13 5 4 0 





Assembly Language Syntax 
ldda [reg_addr] imm asi, reg, 





ldda [reg_plus_imm] %asi, reg, 
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Description 


ASIs 24,6 and 2C;¢ are used with the LDDA instruction to atomically read a 128-bit, virtually 
addressed data item. They are intended to be used by a TLB miss handler to access TSB 
entries without requiring locks. The data is placed in an even/odd pair of 64-bit registers. The 
lowest-address 64 bits are placed in the even register; the highest-address 64 bits are placed 
in the odd-numbered register. The reference is made from the nucleus context. ASIs 24,4 and 
2C;,g are translated by the MMU into physical addresses according to normal translation rules 
for the nucleus context. 


To reduce the number of locked pages in D-TLB a new ASI load instruction, atomic quad 
load physical (1dda ASI QUAD LDD PHYS) was added. It allows a full TTE entry 

(128 bits, tag and data) in TSB to be read directly with PA, bypassing the VA-to-PA 
translation. In the D-TLB miss handler, a TTE entry is read using two 1dx instructions. ASIs 
34,6 and 3C,g are not translated by the MMU and addresses provided are interpreted directly 
as physical addresses. 


Since quad load with these ASIs bypasses the D-MMU, the physical address is set equal to 
the truncated virtual address, that is, PA[42:0] = VA[42:0]. Internally in hardware, the 
physical page attribute bits of these ASIs are hardcoded (not coming from DCU Control 
Register) as follows: 








CP=1, CV=0, IE=0, E=0, P=0, W=0, NFO=0, Size=8K 














Note that (CP, CV) = 10 means it is cacheable in L2-cache, W-cache, and P-cache, but not D- 
cache (since D-cache is VA-indexed). Therefore, this atomic quad load physical instruction 
can only be used with cacheable PA. 


Semantically, ASI QUAD LDD PHYS is like a combination of 
ASI NUCLEUS QUAD LDD and ASI PHYS USE EC. 


An illegal instruction occurs if an odd “rd” register number is used. If non-privileged 
software tries to use this ASI, a privileged action exception occurs. If the physical address of 
the data referenced matches the watchpoint register 

(ASI DMMU PA WATCHPOINT REG,, the PA watchpoint exception occurs. 


In addition to the usual traps for LDDA using a privileged ASI, a data access exception trap 
occurs for a non-cacheable access or if a quadword-load ASI is used with any instruction 
other than LDDA. A mem address not aligned trap is taken if the access is not aligned on a 
128-byte boundary. 


Exceptions 


privileged action 

PA watchpoint (recognized on only the first 8 bytes of an access) 
VA watchpoint (recognized on only the first 8 bytes of an access) 
illegal instruction (misaligned rd) 

mem address not aligned 
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data_access_exception (an attempt to access a page marked as non-cacheable) 
data_access_error 

fast data access MMU miss 

fast data access protection 
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A.30 


Load-Store Unsigned Byte 














Opcode op3 Operation 
LDSTUB 00 1101 Load-Store Unsigned Byte 
Format (3) 


s [c M I 


31 30 29 


op3 rs1 i=1 simm13 


25 24 19 18 14 13 12 5 4 0 


Assembly Language Syntax 
ldstub |Laddress], Teg,q 














Description 


The load-store unsigned byte instruction copies a byte from memory into r [rd], then 
rewrites the addressed byte in memory to all ones. The fetched byte is right-justified in the 
destination register r [rd] and zero-filled on the left. 


The operation is performed atomically, that is, without allowing intervening interrupts or 
deferred traps. In a multiprocessor system, two or more processors executing LDSTUB, 
LDSTUBA, CASA, CASXA, SWAP, or SWAPA instructions addressing all or parts of the same 
doubleword simultaneously are guaranteed to execute them in an undefined, but serial order. 


The effective address for these instructions is “r[rsl] ^r[rs2]"if i = 0, or 
“r[rsl] *sign ext (simm13)"ifi-]l. 


The coherence and atomicity of memory operations between processors and I/O DMA 
memory accesses is maintained for cacheable memory space. 
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Exceptions 


data_access_exception 
data_access_error 

fast data access MMU miss 
fast data access protection 
VA. watchpoint 

PA, watchpoint 








A.3] | Load-Store Unsigned Byte to Alternate 
Space 





Opcode op3 Operation 





01 1101 Load-Store Unsigned Byte into Alternate Space 


LDSTUBAPASI 











Format (3) 


BIG e |" HE — | 


31 30 29 25 24 19 18 14 13 12 5 4 0 


Assembly Language Syntax 
ldstuba [regaddr] imm asi, reg, 


ldstuba [reg. plus imm] %asi, reg,g 





Description 


The load-store unsigned byte into alternate space instruction copies a byte from memory into 
r [rd], then rewrites the addressed byte in memory to all ones. The fetched byte is right- 
justified in the destination register r [rd] and zero-filled on the left. 
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The operation is performed atomically, that is, without allowing intervening interrupts or 
deferred traps. In a multiprocessor system, two or more processors executing LDSTUB, 
LDSTUBA, CASA, CASXA, SWAP, or SWAPA instructions addressing all or parts of the same 
doubleword simultaneously are guaranteed to execute them in an undefined, but serial order. 








LDSTUBA contains the address space identifier (ASI) to be used for the load in the imm asi 
field if i = 0, or in the ASI register if i = 1. The access is privileged if bit 7 of the ASI is 
zero; otherwise, it is not privileged. The effective address is “r [rs1] +r[rs2]” if i — O0, 
or“r[rsl] ^ sign ext(simm13)"ifi-l. 











LDSTUBA causes a privileged action exception if PSTATE . PRIV = 0 and bit 7 of the ASI is 
Zero. 


The coherence and atomicity of memory operations between processors and I/O DMA 
memory accesses is maintained for cacheable memory space. 


Exceptions 


privileged_action 
data_access_exception 
data_access_error 

fast data access MMU miss 
fast data access protection 
VA watchpoint 

PA watchpoint 
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A.32 


A-332 


Logical Operate Instructions (VIS I) 

































































Opcode opf Operation 

EF ZERO 0 0110 0000 Zero fill 

FZEROS 0 0110 0001 Zero fill, single precision 

FONE 0 0111 1110 One fill 

FONES 00111 1111 One fill, single precision 

ESRC1 0 0111 0100 Copy srci 

FSRCIS 00111 0101 Copy src1, single precision 

FSRC2 00111 1000 Copy src2 

FSRC2S 00111 1001 Copy src2, single precision 

FNOT1 0 0110 1010 Negate (ones-complement) srci 

FNOT1S 0 0110 1011 Negate (ones-complement) src1, single precision 
FNOT2 0 0110 0110 Negate (ones-complement) src2 

FNOT2S 0 0110 0111 Negate (ones-complement) src2, single precision 
FOR 0 0111 1100 Logical OR 

FORS 0 0111 1101 Logical OR, single precision 

FNOR 0 0110 0010 Logical NOR 

FNORS 0 0110 0011 Logical NOR, single precision 

FAND 0 0111 0000 Logical AND 

FANDS 0 0111 0001 Logical AND, single precision 

FNAND 0 0110 1110 Logical NAND 

FNANDS 0 0110 1111 Logical NAND, single precision 

FXOR 0 0110 1100 Logical XOR 

FXORS 0 0110 1101 Logical XOR, single precision 

FXNOR 0 0111 0010 Logical XNOR 

FXNORS 0 0111 0011 Logical XNOR, single precision 
FORNOT1 0 0111 1010 Negated src1 OR src2 

FORNOT1S 0 0111 1011 Negated src1 OR src2, single precision 
FORNOT2 0 0111 0110 srcl OR negated src2 

FORNOT2S 0 0111 0111 srcl OR negated src2, single precision 
FANDNOT1 0 0110 1000 Negated srcl AND src2 

FANDNOT1S 0 0110 1001 Negated srcl AND src2, single precision 
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31 30 29 





Opcode 


opf 


Operation 














FANDNOT2 0 0110 0100 srcl AND negated src2 
FANDNOT2S 0 0110 0101 srcl AND negated src2, single precision 
Format (3) 


14 13 









































25 24 19 18 
Assembly Language Syntax 
fzero freg,g 
fzeros freg,g 
fone freg,a 
fones freg,g 
fsrcl freg,s y, fegra 
fsrcls freg, y, frega 
fsrc2 fregys», fregya 
fsrc2s freg,s», Nera 
fnotl freg,s y, fegra 
fnotls freg,s p, fegra 
fnot2 fregys2, fregya 
fnot2s freg,s>, frega 
for freg,s b fregys», frega 
fors freg,s TES, frega 
fnor freg,s b fregys», frega 
fnors freg,s i, freg,s», frega 
fand freg,s TES, frega 
fand freg,s b freg,s», frega 
fnands freg,s p MSps2 frega 
fnands freg,s i, fregys», frega 
xor freg,s p freg,s», frega 
fxors freg,s b fregys», frega 
fxnor freg,s pp fregys», frega 
fxnors freg,s j, freg,s», frega 
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Instruction Definitions 


opf 


rs2 


A-333 





A-334 
































Assembly Language Syntax 
fornotl freg,s j, Jregrs» frega 
fornotls freg,s p fregys», frega 
fornot2 fr e8, i, fregys», frega 
fornot2s freg,s p freg,s», frega 
fandnotl freg,s p fregys», frega 
fandnotls freg,s 1, fFegys», NeSra 
fandnot2 freg,s j, MSps2 frega 
fandnot2s freg,s j, fregys», frega 

Description 


The standard 64-bit versions of these instructions perform 1 of 16 64-bit logical operations 
between the 64-bit floating-point registers specified by rs1 and rs2. The result is stored in 
the 64-bit floating-point destination register specified by rd. The 32-bit (single-precision) 


version of these instructions perform 32-bit logical operations. 


Note — For good performance, the result of a single logical instruction should not be used 
as part of a 64-bit graphics instruction source operand in the next three instruction groups. 
Similarly, the result of a standard logical should not be used as a 32-bit graphics instruction 


source operand in the next three instruction groups. 


Exceptions 


fp. disabled 


UltraSPARC Illi Processor User's Manual * June 2003 





A.33 


Logical Operations 























Opcode op3 Operation 
AND 00 0001 AND 
ANDcc 01 0001 AND and modify condition codes 
ANDN 00 0101 AND Not 
ANDNcc 01 0101 AND Not and modify condition codes 
OR 00 0010 Inclusive OR 
ORcc 01 0010 Inclusive OR and modify condition codes 
ORN 000110 Inclusive OR Not 
ORNcc 010110 Inclusive OR Not and modify condition 
codes 
XOR 00 0011 Exclusive OR 
XORcc 01 0011 Exclusive OR and modify condition codes 
XNOR 000111 Exclusive NOR 
XNORcc 010111 Exclusive NOR and modify condition codes 
Format (3) 
“| MH =- 
op3 
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14 13 12 


Instruction Definitions 


A-335 


























Assembly Language Syntax 

and reg,,j, reg or imm, reg,g 
andcc reg,,p leg or imm, reg, 
andn reg,sj, leg or imm, reg,g 
andncc reg,,j, reg or imm, reg, 
or reg,,p leg or imm, reg, 
orcc reg,,j, leg or imm, reg,g 
orn reg,sj, reg or imm, reg,g 
orncc reg,,p leg or imm, reg,g 
XOr reg,,p leg or imm, reg, 
xorcc TCLs}, leg or imm, reg,g 
xnor reg,sj, leg or imm, reg,g 
xnorcc reg,,p leg or imm, reg, 
Description 


These instructions implement bitwise logical operations. They compute 
“r[rsl] op r[rs2]" if i=0, or “r[rsl] op sign ext (simm13)” if i =1, and 
write the result into r [rd]. 


ANDcc, ANDNcc, ORcc, ORNcc, XORcc, and XNORcc modify the integer condition codes 
(icc and xcc). They set the condition codes as follows: 

+ icc.v,icc.c, xcc.v, and xcc.c to Zero 

* icc.n to bit 31 of the result 

- xcc.n to bit 63 of the result 

- icc.z to one if bits 31:0 of the result are zero (otherwise to zero) 

- xcc.z to one if all 64 bits of the result are zero (otherwise to zero) 


ANDN, ANDNcc, ORN, and ORNcc logically negate their second operand before applying the 
main (AND or OR) operation. 
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Programm 


ing Note — XNOR and XNORcc are identical to the XOR-Not and XOR-Not-cc 


logical operations, respectively. 





Exceptions 


None 





A.34 


Memory Barrier 


CCC eese IH 


MEMBAR 


10 1000 Memory Barrier 





Format (3) 


31 30 29 


25 24 


19 18 14 13 12 76 43 0 





Assembly Language Syntax 


Description 





The memory barrier instruction, MEMBAR, has two complementary functions: to express 
order constraints between memory references and to provide explicit control of memory- 
reference completion. The membar_mask field in the suggested assembly language is the 
concatenation of the cmask and mmask instruction fields. 





before the MI 





MEMBAR introduces an order constraint between classes of memory references appearing 


EMBAR and memory references following it in a program. The particular classes 


of memory references are specified by the mmask field. Memory references are classified as 
loads (including load instructions LDSTUB(A), SWAP(A), CASA, and CASXA and stores 


Chapter A 


Instruction Definitions A-337 


(including store instructions LDSTUB(A), SWAP(A), CASA, CASXA, and FLUSH). The mmask 
field specifies the classes of memory references subject to ordering, as described. MEMBAR 
applies to all memory operations in all address spaces referenced by the issuing processor, 
but it has no effect on memory references by other processors. When the cmask field is 
nonzero, completion as well as order constraints are imposed, and the order imposed can be 
more stringent than that specifiable by the mmask field alone. 





A load has been performed when the value loaded has been transmitted from memory and 
cannot be modified by another processor. A store has been performed when the value stored 
has become visible, that is, when the previous value can no longer be read by any processor. 
In specifying the effect of MEMBAR, instructions are considered to be executed as if they 
were processed in a strictly sequential fashion, with each instruction completed before the 
next has begun. 





The mmask field is encoded in bits 3 through 0 of the instruction. TABLE A-9 specifies the 
order constraint that each bit of mmask (selected when set to one) imposes on memory 
references appearing before and after the MEMBAR. From zero to four, mask bits may be 












































selected in the mmask field. 
TABLE A-9 MEMBAR mmask Encodings 
Mask Bit Name Description 
mmask<3> StoreStore The effects of all stores appearing prior to the MEMBAR instruction must be visible to 
all processors before the effect of any stores following the MEMBAR; it is equivalent to 
the deprecated STBAR instruction. 
mmask<2> LoadStore All loads appearing prior to the MEMBAR instruction must have been performed before 
the effects of any stores following the MEMBAR are visible to any other processor. 
mmask<1> StoreLoad The effects of all stores appearing prior to the MEMBAR instruction must be visible to 
all processors before loads following the MEMBAR may be performed. 
mma sk<0> LoadLoad All loads appearing prior to the MEMBAR instruction must have been performed before 
any loads following the MEMBAR may be performed. 
The cmask field is encoded in bits 6 through 4 of the instruction. Bits in the cmask field, 
described in TABLE A-10, specify additional constraints on the order of memory references 
and the processing of instructions. If cmask is zero, then MEMBAR enforces the partial 
ordering specified by the mmask field; if cmask is nonzero, then completion and partial 
order constraints are applied. 
TABLE A-10 MEMBAR cmask Encodings 
Mask Bit Function Name Description 


Synchronization All operations (including non-memory reference operations) 
barrier appearing prior to the MEMBAR must have been performed and the 


effects of any exceptions be visible before any instruction after the 
MEMBAR may be initiated. 
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TABLE A-10 MEMBAR cmask Encodings (Continued) 











Mask Bit Function Name Description 
cmask[1] Memory issue fMemIssue All memory reference operations appearing prior to the MEMBAR 
barrier must have been performed before any memory operation after the 
MEMBAR may be initiated. 
cmask[0] Lookaside fLookaside [|A store appearing prior to the MEMBAR must complete before any 
barrier load following the MEMBAR referencing the same address can be 
initiated. 

















The encoding of MEMBAR is identical to that of the RDASR instruction, except that rs1 = 15, 
rd=0, andi=1. 


The coherence and atomicity of memory operations between processors and I/O DMA 
memory accesses is maintained for cacheable memory space. 








Compatibility Note — MEMBAR with mmask = 846 and cmask = 046 
(“membar #StoreStore’”) is identical in function to the SPARC-V8 STBAR instruction, 
which is deprecated. 





The information included in this section should not be used for the decision as to when 
MEMBARs should be added to software that needs to be compliant across all 
UltraSPARC-based platforms. The operations of block load/block store (BLD/BST) on the 
UltraSPARC IIli processor are generally more ordered with respect to other operations, 
compared to the UltraSPARC I processor and the UltraSPARC II processor. Code written and 
found to “work” on the UltraSPARC Ili processor may not work on the UltraSPARC I 
processor and the UltraSPARC II processor if it does not follow the rules for BLD/BST 
specified for those processors. Code that happens to work on the UltraSPARC I processor 
and the UltraSPARC II processor may not work on the UltraSPARC IIi processor if it did 
not meet the coding guidelines specified for those processors. In no case is the coding 
requirement for the UltraSPARC Illi processor more restrictive than that for the 
UltraSPARC I and the UltraSPARC II processors. 





Software developers should not use the information in this section for determining the need 
for MEMBARs but instead should rely on the SPARC-V9 MEMBAR rules. These 

UltraSPARC IIli processor rules are less restrictive than SPARC-V9, UltraSPARC I 
processor, and the UltraSPARC II processor rules and are never more restrictive. 








MEMBAR Rules 


The UItraSPARC IIIi hardware uses the following rules to guide the interlock 
implementation. 


1. Non-cacheable load or store with side-effect bit on will always be blocked. 


2. Cacheable or non-cacheable BLD will not be blocked. 
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3. VA<12:5> of a load (cacheable or non-cacheable) will be compared with the VA<12:5> of 
all entries in Store Queue. When a matching is detected, this load (cacheable or non- 
cacheable) will be blocked. 





4. An insertion of MEMBAR is required if Strong Ordering is desired while not fitting 
rules | to 3. 


TABLE A-11 and TABLE A-12 reflect the hardware interlocking mechanism implemented in the 
UltraSPARC IIli processor. The tables are read from Row to Column, the first memory 
operation in program order being in Row followed by the memory operation found in 
Column. The following two symbols are used as table entries: 


- #— No intervening operation required because Fireplane-compliant systems 
automatically order R before C. 











- M — MEMBAR #Sync or MEMBAR #MemIssue or MEMBAR #StoreLoad required. 


For VA<12:5> of a column operation not matching with VA<2:5> of a row operation while a 
strong ordering is desired, the MEMBAR rules summarized in TABLE A-11 reflect the 
UltraSPARC Ili processor's hardware implementation. 








TABLE A-11 MEMBAR Rules for Column VA «12:5» + Row VA «12:5» While Desiring Strong 












































Ordering 
To Column Operation C: 

= 

Dn 

< 

T 

fal 

E P- 

5 2 | 2 E 

E aj ^ | o o 

o o o o o = 

a | 5 a 3 p " 
From Row 3 Zz Zz g Ei s s 

: s g g o S z 2 

Operation R: s sS s * 2 2 2 
load # M M 
load from internal ASI # # # 
store # M M 
atomic # M |M M 
load_nc_e # M M M 
store_nc_e # M M M 
store_nc_ne # M M M M 
bload M |M M |M M 
bstore M |M M |M M 
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TABLE A-11 





Ordering (Continued) 


MEMBAR Rules for Column VA <12:5> # Row VA <12:5> While Desiring Strong 





To Column Operation C: 
































= 

N 

< Z 

E < 

E E x 

E & 9 2 a al s 5 o 

i 2 2 E 2 z 2 RE 5 

A = 2 5 | 5, | = o o m o 
From Row z E g g E g Tz g s 5 5 E: 5 

i $ 3 8 2 s?Islels|l$l2s]lzlz]|2]z 

Operation R: = = a a a | £ LZ = a z 2 2 z 2 
bstore_commit M # M # M M M M M M # M M 
bload_nc M # M # M M M M M # M M 
bstore nc M M M 








When VA<12:5> of a column operation matches VA<12:5> of a row operation, the MEMBAR 
rules summarized in TABLE A-12 reflect the UltraSPARC IIIi's hardware implementation. 


TABLE A-12 MI 








EMBAR Rules for Column VA<12:5> = Row VA<12:5> While Desiring Strong 




















Ordering 
To Column Operation C: 
2 
< 
T 
fal 
z x 
E a 2 E 
o 
E o o S S 
& = = o o 
From Row rs 3 Zs gz s 5 
« a «a a * 2 
Operation R: E s sS s E 2 
load # 
load from internal ASI # 
store # 
store to internal ASI # 
atomic # 
load_nc_e # 
store_nc_e # 
load_nc_ne # 
store_nc_ne # 
bload # 
bstore # 
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TABLE A-12. MEMBAR Rules for Column VA<12:5> = Row VA<12:5> While Desiring Strong 
Ordering (Continued) 






































To Column Operation C: 

= 

N 

< Z 

3 < 

E E 9 o E 

E £ €» [898 ]|28..85 E S o 

o o o o o 2 o | E 

£ £ 2/5 |5|8 JĀ Ta o e | ale 
From Row z E g 2 Els 2 z g E: 5 s E: 5 

N g b e e e s e b e S * * S * 

Operation R: E s v E RIS a E a a 2 E a 2 
bstore_commit M # M # IM M M M M M M # M M 
bload nc # # # # Sou # # # # # # # # 
bstore nc # 





Special Rules for Quad LDD (ASI 24,, and ASI 2C;) 


MEMBAR is only required before quad LDD if VA<12:5> of a preceding store to the same 
address space matches VA<12:5> of the quad LDD. 





Exceptions 


None 
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A.35 


Move Floating-Point Register on Condition 
(FMOVcc) 


For Integer Condition Codes 
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FMOVA 11 0101 1000 |Move Always 1 

FMOVN 11 0101 0000 |Move Never 0 

FMOVNE 11 0101 1001  |Move if Not Equal not Z 

FMOVLE 11 0101 0010 |Move if Less or Equal Z or (N xor V) 
FMOVLEU 110101 10100 |Move if Less or Equal Unsigned (C or Z) 

FMOVCC 11 0101 1101 |Move if Carry Clear (Greater or Equal, Unsigned) 
FMOVCS 110101 |0101 |Move if Carry Set (Less than, Unsigned) 
FMOVNEG 110101 |0110 |Move if Negative N 

FMOVVC 11 0101 1111  |Move if Overflow Clear not V 
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For Floating-Point Condition Codes 

































































Opcode op3 cond Operation fcc Test 
FMOVFA 11 0101 Move Always 1 

FMOVF 11 0101 Move Never 0 

FMOVFU 11 0101 Move if Unordered U 

FMOVFG 11 0101 Move if Greater G 
FMOVFUG 11 0101 Move if Unordered or Greater Gor U 
FMOVFL 11 0101 Move if Less L 

FMOVFUL 11 0101 Move if Unordered or Less L or U 
FMOVFLG 11 0101 Move if Less or Greater LorG 
FMOVFNE 11 0101 Move if Not Egual LorGorU 
FMOVFE 11 0101 Move if Egual E 

FMOVFUE 11 0101 Move if Unordered or Egual EorU 
FMOVFGE 11 0101 Move if Greater or Egual EorG 
FMOVFUGE 11 0101 Move if Unordered or Greater or Egual EorGorU 
FMOVFLE 11 0101 Move if Less or Egual EorL 
FMOVFULE 11 0101 Move if Unordered or Less or Egual EorLorU 
FMOVFO 11 0101 Move if Ordered EorLorG 

Format (4) 
10 rd op3 0 cond opf cc opf low rs2 

31 30 29 25 24 19 18 17 14 13 11 10 5 4 
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Encoding of the opf_cc Field 





























opf_cc Condition Code 
000 fcc0 

001 fccl 

010 

011 

100 

101 

110 

111 






































Encoding of opf Field (opf cc || opf. low) 

Instruction Variation opf cc opf low opf 
FMOVScc $£ccn,rs2,rd Onn 00 0001 0 nn00 0001 
FMOVDcc Sfccn,rs2,rd Onn 00 0010 0 nn00 0010 
FMOVOcc $fcon,rs2,rd Onn 00 0011 0 nn00 0011 
FMOVScc Sicc, rs2,rd 100 00 0001 1 0000 0001 
FMOVDcc $icc, rs2,rd 100 00 0010 1 0000 0010 
FMOVOcc Sicc, rs2,rd 100 000011 1 0000 0011 
FMOVScc Sxcc, rs2, rd 110 00 0001 1 1000 0001 
FMOVDcc $xcc, rs2, rd 110 00 0010 1 1000 0010 
FMOVOcc Sxcc, rs2,rd 110 000011 1 1000 0011 
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For Integer Condition Codes 












































Assembly Language Syntax 

fmov{s,d,q}a i or x cc, freg,s», frega 

fmov{s,d,q}n i or x cc, freg», frega 

fmov{s,d,q}ne i or x cc, freg, frega (synonyms: fmov(s,d,g)nz) 
fmoví(s,d,q)e i or x cc, freg, frega (synonyms: £mov(s,d,q)z) 
fmov(s,d,g)g i or x cc, freg,s», frega 

fmov{s,d,q}le i or x cc, fregn frega 

tmov(s,d,q)ge i or x cc, freg,s», frega 

fmov(s,d,a)l i or x cc, fregn frega 

fmov{s,d,q}gu i or x cc, freg,s», frega 

fmov(s,d,g)leu i or x cc, fregz» frega 

fmov{s,d,q}cc i or x cc, fregn frega (synonyms: £mov(s,d,q)geu) 
fmov{s,d,q}cs i or x cc, freg, frega (synonyms: £mov(s,d,q)1u) 
fmov{s,d,q}pos i or x cc, freg,s», frega 

fmov{s,d,q}neg i or x cc, fregn frega 

fmov{s,d,q}vc i or x cc, freg, frega 

fmov{s,d,q}vs i or x cc, freg, frega 
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Programming Note — To select the appropriate condition code, include $icc or $xcc 
before the registers. 





For Floating-Point Condition Codes 















































Assembly Language Syntax 
fmov{s,d,q}a Siccn, fregz frega 
fmov(s,d,g)n $fcon, freg, freg, 
fmov{s,d,q}u Siccn, fregz frega 
fmov(s,d,g)g $fcon, fregz frega 
fmov{s,d,q}ug Siccn, fregz, frega 
fmov(s,d,g)l $fcon, fregz freg, 
fmov(s,d,g)ul $fcch, freg,sp, frega 
fmov(s,d,g)lg $fcon, fregz frega 
fmov{s,d,q}ne Siccn, freg,.», frega (synonyms: £mov(s,d,q)nz) 
fmov(s,d,g)e Siccn, fregz, frega (synonyms: fmov{s, d, q} z) 
fmov{s,d,q}ue Siccn, freg,s2 fregra 
fmov{s,d,q}ge Siccn, fregz frega 
fmov{s,d, q}uge Siccn, fregz frega 
fmov{s,d,q}le $fcon, freg,» frega 
fmov{s,d,q}ule Siccn, freg,s», fregra 
fmov{s,d,q}o $fcon, fregz, frega 
Description 


These instructions copy the floating-point register(s) specified by rs2 to the floating-point 
register(s) specified by rd if the condition indicated by the cond field is satisfied by the 
selected condition code. The condition code used is specified by the op£ cc field of the 
instruction. If the condition is FALSE, then the destination register(s) are not changed. 





These instructions do not modify any condition codes. 
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Programming Note — In general, branches cause the processor's performance to degrade. 


Frequently, the MOVcc and FMOVcc instructions can be used to avoid branches. For 
example, the following C language segment: 

double A, B, X; 

if (A > B) then X = 1.03; else X = 0.0; 


can be coded as 
! assume A is in $f0; B is in $f2; %xx points to constant area 


ldd [Sxxt+C_1.03],%f4 ! X = 1.03 
fcmpd Sfcc3,5f0,9f2 IA B 


fble ,a $fcc3,label 
! following only executed if the branch is taken 
fsubd $f4,$9f4,$f4 ! X = 0.0 

label:... 


This code takes four instructions including a branch. 


With FMOVcc, this could be coded as 


ldd [$xx4C 1.03],$f4 LX = 1.03 
fsubd $f£4,$f4,$f6 I X” = 0.0 
fcmpd $fcc3,$f0,$f2 ! A^ B 
fmovdle $fcc3,$f6,$f4 ! X = 0.0 


This code also takes four instructions but requires no branches and may boost performance 
significantly. Use MOVcc and FMOVcc instead of branches wherever these instructions would 
improve performance. 


Exceptions 


fp. disabled 
fp exception other (ftt = unimplemented FPop (op£ cc = 101, or 1115 and quad forms)) 
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A.36 | Move Floating-Point Register on Integer 


Register Condition (FMOVr) 



































Operation Test 

— Reserved = 
FMOVRZ Move if Register Zero r[rs1]=0 
FMOVRLEZ Move if Register Less Than or Equal to Zero r[rs1] <0 
FMOVRLZ Move if Register Less Than Zero r[rs] 

== Reserved = 
FMOVRNZ Move if Register Not Zero r[rs1] #0 
FMOVRGZ Move if Register Greater Than Zero r[rs1] >0 
FMOVRGEZ Move if Register Greater Than or Equal to Zero 


Format (4) 

















el [= [= €-9 = 


31 30 29 25 24 19 18 14 13 12 


Encoding of opf. low Field 




















Instruction variation opf low 
FMOVSrcond rsl, rs2, rd 00101 
FMOVDrcond rsl, rs2, rd 00110 
FMOVOrcond  rsl,rs2, rd 00111 
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Assembly Language Syntax 























fmovr(s,d,g)e Vegri M852 WS rq (synonym: fmovr{s,d,q}z) 
fmovr{s,d,q}lez reg psp VCS p32, MCS rd 
fmovr{s,d,q}lz Teg, 1, MCL p52, WEL rd 
fmovr{s,d,q}ne regis1 fregrs2 freSra | synonym: fmovr(s,d,g)nz) 
fmovr{s,d,q}gz reg psp VCS p32, MCS rd 
fmovr(s,d,g)gez reg, y, fYeg,s2, frega 
Description 


If the contents of integer register r[rs1] satisfy the condition specified in the rcond field, 
these instructions copy the contents of the floating-point register(s) specified by the rs2 field 
to the floating-point register(s) specified by the rd field. If the contents of r[rs1] do not 

satisfy the condition, the floating-point register(s) specified by the rd field are not modified. 


These instructions treat the integer register contents as a signed integer value; they do not 
modify any condition codes. 





Implementation Note — The UltraSPARC IIIi processor does not implement this 
instruction by tagging each register value. The UltraSPARC IIi processor looks at the full 
64-bit register to determine a negative or zero. 





Exceptions 


fp. disabled 
Jp. exception other (unimplemented FPop (xcond = 000; or 1005 and quad forms)) 


UltraSPARC Illi Processor User's Manual * June 2003 





A.37 _ Move Integer Register on Condition 
(MOVcc) 


For Integer Condition Codes 





















































Opcode op3 cond Operation icc/xcc Test 
OVA 10 1100 1000 Move Always 
OV. 10 1100 
OVNE 10 1100 Move if Not Equal not Z 
OVE 10 1100 Move if Equal Z 
OVG 10 1100 Move if Greater not (Z or (N xor V)) 
OVLE 10 1100 Move if Less or Equal Z or (N xor V) 
OVGE 10 1100 Move if Greater or Equal not (N xor V) 
OVL 10 1100 Move if Less N xor V 
OVGU 10 1100 Move if Greater Unsigned not (C or Z) 
OVLEU 10 1100 Move if Less or Equal Unsigned 
OVCC 10 1100 Move if Carry Clear (Greater or Equal, Unsigned) 
OVCS 10 1100 Move if Carry Set (Less than, Unsigned) 
OVPOS 10 1100 Move if Positive 
OVNEG 10 1100 
OVVC 10 1100 1111 Move if Overflow Clear not V 
OVVS 10 1100 0111 Move if Overflow Set V 
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For Floating-Point Condition Codes 
































OVFA Move Always 

OVEN Move Never 

OVFU Move if Unordered U 

OVFG G 

OVFUG Move if Unordered or Greater 

OVFL Move if Less 

OVFUL Move if Unordered or Less 

OVFLG 

OVFNE Move if Not Egual LorGorU 
OVFE Move if Egual E 

OVFUE Move if Unordered or Egual E or U 
OVFGE 

OVFUGE Move if Unordered or Greater or Equal E or G or U 
OVFLE Move if Less or Equal EorL 
OVFULE Move if Unordered or Less or Equal EorLor U 





























Format (4) 


O | = HBEH IG 


11 
31 30 29 25 24 19 18 17 14 13 12 11 10 5 4 0 
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cc2 cc1 cc0 Condition Code 
000 fcc0 

001 fccl 

010 fcc2 

011 fcc3 

100 icc 





101 


Reserved 





110 


XCC 








111 





Reserved 





For Integer Condition Codes 



































Assembly Language Syntax 

mova i or x cc, reg or immll, reg, 

movne i or x cc, reg or immll, reg,q (synonym: movnz) 
move i or x cc, reg or immll, reg,q (synonym: movz) 
movg i or x cc, reg or immll, reg, 

movge i or x cc, reg or immll, reg,4 

movl i or x cc, reg or immll, reg,q 

movgu i or x cc, reg or immll, reg, 

movcc i or x cc, reg or immll, reg,q (synonym: movgeu) 
movcs i or x cc, reg or immll, reg,q (synonym: movlu) 
movpos i or x cc, reg or immll, reg, 

movvc i or x cc, reg or immll, reg,q 

movvs i or x cc, reg or immll, reg,g 
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Programming Note — To select the appropriate condition code, include $icc or $xcc 
before the register or immediate field. 





For Floating-Point Condition Codes 





Assembly Language Syntax 

















mova $fccn, reg or immll, reg, 
movn $fccn, reg or immll, reg, 
movu $fccn, reg or immll, reg, 
movg $fccn, reg or imml1, 
movug $fccn, reg or imml1, 
movl $fccn, reg or imml1, 


movlg $fccn, reg or immll, reg, 





movne $fccn, reg or immll, reg, (synonym: movnz) 





move $fccn, reg or immll, reg,q (synonym: movz) 


movge $fccn, reg or immll, reg, 





movuge £fccn, reg or immll, re 
g rd 





movle £fccn, reg or immll, re 
rd 


Ex RSEN reg or imm], "8rd dE! 











Programming Note — To select the appropriate condition code, include $fcc0, $fcc1, 
$fcc2,0r $fcc3 before the register or immediate field. 


Description 





These instructions test to see if cond is TRUE for the selected condition codes. If so, they 
copy the value in r[rs2] if the i field = 0, or “sign_ext(simm11)” if i =1 into 

r [rd]. The condition code used is specified by the cc2, cc1, and cc0 fields of the 
instruction. If the condition is FALSE, then r [rd] is not changed. 
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These instructions copy an integer register to another integer register if the condition is 

TRUE. The condition code that is used to determine whether the move will occur can be 
either integer condition code (icc or xcc) or any floating-point condition code (£cc0, 
fccl, fcc2, or fcc3). 





These instructions do not modify any condition codes. 





Programming Note — In general, branches cause the processor performance to degrade. 
Frequently, the MOVcc and FMOVcc instructions can be used to avoid branches. For 
example, consider the C language if-then-else statement: 

if (A> B) then X = 1; else X = 0; 





can be coded as 


cmp $i0,$i2 

bg,a Sxcc, label 

or $g0,1,$1i3 LX =1 

or $g0,0,%1i3 ' xX = 0 
label:... 


This takes four instructions including a branch. With MOVcc, this could be coded as 


cmp %10,%12 
or $g0,1,%1i3 ! assume X = 1 
movle $xcc,0,$1i3 ! overwrite with X = 0 


This approach takes only three instructions and no branches and may boost performance 
significantly. Use MOVcc and FMOVcc instead of branches wherever these instructions would 
increase performance. 


Exceptions 





illegal_instruction (cc2 eel cc0 = 101, or 1115) 
fp. disabled (cc2 cci cc0 = 0005, 0015, 0105, or 011, and the FPU is disabled) 
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A.38 Move Integer Register on Register Condition 
(MOVr) 




















Opcode Operation 
— Reserved 
OVRZ Move if Register Zero 
OVRLEZ Move if Register Less Than or Equal to Zero 
OVRLZ Move if Register Less Than Zero 
— Reserved 
OVRNZ Move if Register Not Zero 
OVRGZ Move if Register Greater Than Zero 
































OVRGEZ 10 1111 Move if Register Greater Than or Equal to Zero 


Format (3) 


IGI |" Hel [= 


31 30 29 25 24 19 18 141312 10 9 5 4 0 














Assembly Language Syntax 

movrz reg,., eg or imml0, reg,g (synonym: movre) 
movrlez reg,,j, reg or imml0, reg, 

movrlz reg,,j, reg or imml0, reg, 

movrnz reg,,j, reg or imml0, reg, (synonym: movrne) 
movrgz reg, , reg or imml0, reg,q 

movrgez reg,,j, reg or imml0, reg, 
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Description 


If the contents of integer register r[rs1] satisfy the condition specified in the rcond field, 
these instructions copy r[rs2] (ifi =0) or sign_ext(simm10) (if i = 1) into r[rd]. 
If the contents of r[rs1] do not satisfy the condition, then r [rd] is not modified. These 
instructions treat the register contents as a signed integer value; they do not modify any 
condition codes. 





Implementation Note — The UltraSPARC IIIi processor does not implement this 
instruction by tagging each register value. The UltraSPARC Ili processor looks at the full 
64-bit register to determine a negative or zero. 





Exceptions 


illegal instruction (ccond = 000; or 1005) 





A.39 Multiply and Divide (64-bit) 


LM: MA — 
00 1001 Multiply (signed or unsigned) 


Signed Divide 
Unsigned Divide 








Format (3) 
10 rd op3 rs1 i-0 — rs2 
10 rd op3 rs1 i=1 simm13 
31 30 29 25 24 19 18 14 13 12 5 4 0 
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Assembly Language Syntax 

















mulx reg,.j, reg_or_imm, reg, 
sdivx reg,.j, reg_or_imm, reg, 
udivx reg,.j, reg_or_imm, reg, 
Description 


MULX computes “r[rsl] x r[rs2]” ifi >0or“r[rs1] X sign_ext (simm13)” if 
i = 1, and writes the 64-bit product into r [rd]. MULX can be used to calculate the 64-bit 
product for signed or unsigned operands (the product is the same). 


SDIVX and UDIVX compute *r[rs1] +r[rs2]” if i=0 or 

“r[rsl] sign ext (simm13)” if i =1, and write the 64-bit result into r [rd]. 
SDIVX operates on the operands as signed integers and produces a corresponding signed 
result. UDIVX operates on the operands as unsigned integers and produces a corresponding 
unsigned result. 


For SDIVX, if the largest negative number is divided by —1, the result should be the largest 
negative number. That is: 


8000 0000 0000 0000; , + FFFF FFFF FFFF FFFF,g = 8000 0000 0000 0000; c. 


These instructions do not modify any condition codes. 


Exceptions 


division by zero 





A.40 _ No Operation 





Opcode op op2 Operation 
NOP 0 0000 100 No Operation 
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Format (2) 


00 op op2 0000000000000000000000 
31 30 29 25 24 22 21 0 


Assembly Language Syntax 


nop 





Description 


The NOP instruction changes no program-visible state (except that of the PC and nPC). 





NOP is a special case of the SETHI instruction, with imm22 = 0 and rd=0. 


Exceptions 


None 





A.41 Partial Store (VIS I) 





Opcode imm_asi ASI Value Operation 
Eight 8-bit conditional stores to primary address space 


Eight 8-bit conditional stores to secondary address space 





Eight 8-bit conditional stores to primary address space, little-endian 


Eight 8-bit conditional stores to secondary address space, little-endian 





Four 16-bit conditional stores to primary address space 


Four 16-bit conditional stores to secondary address space 





Four 16-bit conditional stores to primary address space, little-endian 








Four 16-bit conditional stores to secondary address space, little-endian 





Two 32-bit conditional stores to primary address space 














Two 32-bit conditional stores to secondary address space 
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Opcode imm_asi ASI Value Operation 





STDFA ASI PST32 PL CCig Two 32-bit conditional stores to primary address 
space, little-endian 








STDFA ASI PST32 SL CD16 Two 32-bit conditional stores to secondary address 
space, little-endian 











Format (3) 


31 30 29 25 24 19 18 14 13 5.4 0 





Assembly Language Syntax! 





stda freg,g, re8,s2, (reg, imm asi 











— 


The original assembly language syntax for a partial store instruction (st da 
frega, l[reg,.;] reg,» imm_asi”) has been deprecated because of 
inconsistency with the rest of the SPARC assembly language. Over time, 
assemblers will support the new syntax for this instruction. In the meantime, 
some assemblers may recognize only the original syntax. 


Description 


The partial store instructions are selected by one of the partial store ASIs with the STDFA 
instruction. 


Two 32-bit, four 16-bit, or eight 8-bit values from the 64-bit floating-point register specified 
by rd are conditionally stored at the address specified by r [rs1], using the mask specified 
in r[rs2]. The value in r[rs2] has the same format as the result specified by the pixel 

compare instructions (see Section A.44, *Pixel Compare (VIS I)"). The most significant bit 

of the mask (not the entire register) corresponds to the most significant part of the floating- 
point register specified by rd. The data is stored in little-endian form in memory if the ASI 
name has an “L” suffix; otherwise, it is stored in big-endian format. 


A partial store instruction can cause a virtual (or physical) watchpoint exception when the 
following conditions are met: 


- The virtual (physical) address in x [rs1] matches the address in the VA (PA) Data 
Watchpoint Register. 


- The byte store mask in r [rs2] indicates that a byte is to be stored. 
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The Virtual (Physical) Data Watchpoint Mask in DCUCR indicates that one or more of the 
bytes to be stored at the watched address is being watched. 


Watchpoint exceptions on partial store instructions behaves as if every partial store always 
stores all 8 bytes. The DCUCR Data Watchpoint masks are only checked for nonzero value 
(watchpoint enabled). The byte store mask (r [rs2]) in the partial store instruction is 
ignored, and a watchpoint exception can occur even if the mask is zero (that is, no store will 
take place). 


ASIs C0,6-C5,6 and C8,¢-CDj¢ are only used for partial store operations. In particular, they 
should not be used with the LDDFA instruction. 





Note — If the byte ordering is little-endian, the byte enables generated by this instruction are 
swapped with respect to big-endian. 





Exceptions 


fp. disabled 


illegal instruction (When i — 1, no immediate mode is supported.) 
PA  watchpoint 

VA. watchpoint 

mem address not aligned 

data access exception 

data access. error 


fast data, access MMU miss 
fast data access protection 








A.42 


Partitioned Add/Subtract Instructions (VIS I) 





Opcode opf Operation 
FPADD16 0 0101 0000 Four 16-bit Add 





FPADD16S 0 0101 0001 Two 16-bit Add 





FPADD32 0 0101 0010 Two 32-bit Add 
FPADD32S 0 0101 0011 One 32-bit Add 





FPSUB16 0 0101 0100 Four 16-bit Subtract 
FPSUB16S 00101 0101 Two 16-bit Subtract 











FPSUB32 00101 0110 Two 32-bit Subtract 
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Opcode opf Operation 





FPSUB32S 00101 0111 One 32-bit Subtract 








Format (3) 


31 30 29 


A-362 


25 24 19 18 14 13 5 4 0 





























Assembly Language Syntax 
fpaddl6 freg,s y, freg,s». frega 
fpaddl6s freg,s j, freg,s», frega 
fpadd32 freg,s y, freg,s». frega 
fpadd32s freg,s i, JT€&ys2, fegra 
fpsubl6 fre8,sj freg,s». frega 
fpsubl6s freg,s j, freg,s». frega 
fpsub32 freg,s j, freg,s», frega 
fpsub32s freg,s i, freg,s», frega 

Description 


The standard versions of these instructions perform four 16-bit or two 32-bit partitioned adds 
or subtracts between the corresponding fixed-point values contained in the source operands 
(the 64-bit floating-point registers specified by rs1 and rs2). For subtraction, the second 
operand (rs2) is subtracted from the first operand (rs1). The result is placed in the 64-bit 
destination register specified by rd. 


The single-precision versions of these instructions (PPADD16S, FPSUB16S, FPADD32S, 
FPSUB32S) perform two 16-bit or one 32-bit partitioned add(s) or subtract(s); only the low 
32 bits of the destination register are affected. 


Note — For good performance, the result of a single FPADD should not be used as part of a 
source operand of a 64-bit graphics instruction in the next instruction group. Similarly, the 
result of a standard FPADD should not be used as a 32-bit graphics instruction source 
operand in the next three instruction groups. 
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Exceptions 


fp. disabled 





A.43 Partitioned Multiply Instructions (VIS I) 





Opcode opf Operation 





FMUL8x16 0 0011 0001 8-bit x 16-bit Partitioned Product 





FMUL8x16AU 0 0011 0011 8-bit x 16-bit Upper & Partitioned Product 


c 





FMUL8x16AL 0 0011 0101 8-bit x 16-bit Upper & Partitioned Product 





FMUL8SUx16 0 0011 0110 Upper 8-bit x 16-bit Partitioned Product 





£ 


FMUL8ULx16 0 0011 0111 Lower Unsigned 8-bit x 16-bit Partitioned Product 


LD8SUx16 0 0011 1000 Upper 8-bit x 16-bit Partitioned Product 




















LD8ULx16 0 0011 1001 Lower Unsigned 8-bit x 16-bit Partitioned Product 








Format (3) 


31 30 29 25 24 19 18 14 13 5 4 0 


























Assembly Language Syntax 
fmul8x16 fregys 1, egrs» fegra 
fmul8x16au freg,sy, fYeg,s2, fegra 
fmul8x16al freg,sy, fYeg,s2, fegra 
fmul8sux16 freg,sy, HS ps2, fegra 
fmul8ulx16 freg sp HS ps2, fegra 
fmuld8sux16 freg sb HS ps2, fregyq 
fmuld8ulx16 freg,sy, fYeg,s2, fegra 
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Description 





Note — For good performance, the result of a partitioned multiply should not be used as a 
32-bit graphics instruction source operand in the next three instruction groups. 





Programming Note — When software emulates an 8-bit unsigned byl6-bit signed 
multiply, the unsigned value must be zero-extended and the 16-bit value sign-extended before 
the multiplication. 








Note — For good performance, the result of a partitioned multiply should not be used as a 
source operand of a 32-bit graphics instruction in the next three instruction groups. 


The following sections describe the versions of partitioned multiplies. 


Exceptions 


fp. disabled 


FMULSx16 Instruction 


FMUL8x16 multiplies each unsigned 8-bit value (that is, a pixel) in £[rs1] by the 
corresponding (signed) 16-bit fixed-point integer in the 64-bit floating-point register specified 
by rs2; it rounds the 24-bit product (assuming binary point between bits 7 and 8) and stores 
the upper 16 bits of the result into the corresponding 16-bit field in the 64-bit floating-point 
destination register specified by rd. FIGURE A-5 illustrates the operation. 


Note — This instruction treats the pixel values as fixed-point with the binary point to the left 
of the most significant bit. Typically, this operation is used with filter coefficients as the 
fixed-point rs2 value and image data as the rs1 pixel value. Appropriate scaling of the 
coefficient allows various fixed-point scaling to be realized. 
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rs1 





rs2 











X MSB X MSB X MSB X MSB 
63 48 47 32 31 16 15 0 








rd 





FIGURE A-5 FMUL8x16 Operation 


A.43.2 FMULSx16AU Instruction 


FMUL8x16AU is the same as FMUL8x16, except that one 16-bit fixed-point value is used for 
all four multiplies. This value is the most significant 16 bits of the 32-bit register f [rs2], 
which is typically a proportional value. FIGURE A-6 illustrates the operation. 





rs1 











rs2 











63 48 47 32 31 16 15 0 





rd 





FIGURE A-6 FMULSx16AU Operation 


A.43.3 FMUL8x16AL Instruction 


FMUL8x1 6AL is the same as FMUL8x16AU, except that the least significant 16 bits of the 
32-bit register f [rs2] register are used as a proportional value. FIGURE A-7 illustrates the 
operation. 
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31 24 23 1615 8 7 0 











rs1 ^ ^ / I 


Pa 7 T 
pov 16715 p? 











rs2 














63 48 47 32 31 16 15 0 
rd 





FIGURE A-7 FMUL8x16AL Operation 


A.43.4 FMULS8SUx16 Instruction 


FMUL8SUx16 multiplies the upper 8 bits of each 16-bit signed value in the 64-bit floating- 
point register specified by rs1 by the corresponding signed, 16-bit, fixed-point, signed 
integer in the 64-bit floating-point register specified by rs2. It rounds the 24-bit product 
toward the nearest representable value and then stores the upper 16 bits of the result into the 
corresponding 16-bit field of the 64-bit floating-point destination register specified by rd. If 
the product is exactly halfway between two integers, the result is rounded toward positive 
infinity. FIGURE A-8 illustrates the operation. 


63 5655 48 47 40 39 32 31 24 23 16 15 8 7 0 
rs1 


rs2 











FIGURE A-8 FMUL8SUx1 6 Operation 
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A.43.5 FMULSULx16 Instruction 


FMUL8ULx16 multiplies the unsigned lower 8 bits of each 16-bit value in the 64-bit floating- 
point register specified by rs1 by the corresponding fixed-point signed integer in the 64-bit 
floating-point register specified by rs2. Each 24-bit product is sign-extended to 32 bits. The 
upper 16 bits of the sign-extended value are rounded to nearest representable value and then 
stored in the corresponding 16-bit field of the 64-bit floating-point destination register 
specified by rd. If the result is exactly halfway between two integers, the result is rounded 
toward positive infinity. FIGURE A-9 illustrates the operation. CODE EXAMPLE A-5 shows an 
example. 


63 56 55 48 47 40 39 32 31 24 23 1615 8 7 0 





x signed-extended X signed-extended x signed-extended signed-extended 


8 MSB 8 MSB 8 MSB 8 MSB 
63 48 47 32 31 16 15 0 














FIGURE A-9 FMUL8LUx1 6 Operation 


CODE EXAMPLE A-5 FMUL8LUx16 Operation 


fmul8sux16 Sf0, $fl, $f2 
fmul8ulx16 sf0, sf1, $f3 


fpaddl6 $f2, $f3, $f4 











A.43.6 FMULD8SUx16 Instruction 


FMULD8SUx16 multiplies the upper 8 bits of each 16-bit signed value in f[rs1] by the 
corresponding signed 16-bit fixed-point signed integer in £ [rs2]. Each 24-bit product is 
shifted left by 8 bits to make up a 32-bit result, which is then stored in the 64-bit floating- 
point register specified by rd. FIGURE A-10 illustrates the operation. 
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A.43.7 


A-368 


rs1 


rs2 








31 16 15 0 























x< 
x< 


63 40 39 32 31 8 7 0 


FIGURE A-10 FMULD8SUx16 Operation 





FMULD8ULx16 Instruction 


FMULD8ULx16 multiplies the unsigned lower 8 bits of each 16-bit value in £[rs1] by the 
corresponding fixed-point signed integer in £ [rs2]. Each 24-bit product is sign-extended to 
32 bits and stored in the 64-bit floating-point register specified by rd. FIGURE A-11 illustrates 
the operation; CODE EXAMPLE A-6 exemplifies the operation. 





rsi 
rs2 
X sign-extended X sign-extended 
63 32 31 0 
rd 





FIGURE A-11 FMULD8ULx16 Operation 
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CODE EXAMPLE A-6 FMULD8ULx16 Operation 


fmuld8sux16  $f0, $f1, $f2 
fmuld8ulxl6  $f0, $f1, $f3 


fpadd32 $f2, $f3, $f4 











A.44 . Pixel Compare (VIS I) 









































Opcode opf Operation 

FCMPGT16 0 0010 1000 Four 16-bit Compares; set rd if srcl > src2 
FCMPGT32 0 0010 1100 Two 32-bit Compares; set rd if srcl > src2 
ECMPLE16 0 0010 0000 Four 16-bit Compares; set rd if srcl < src2 
FCMPLE32 0 0010 0100 Two 32-bit Compares; set rd if src1 < src2 
FCMPNE16 0 0010 0010 Four 16-bit Compares; set rd if srcl * src2 
FCMPNE32 000100110 Two 32-bit Compares; set rd if srcl + src2 
FCMPEQ16 000101010 Four 16-bit Compares; set rd if srcl = src2 
FCMPEQ32 00010 1110 Two 32-bit Compares; set rd if srcl = src2 
Format (3) 


31 30 29 25 24 19 18 14 13 
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A-369 





A-370 



































Assembly Language Syntax 
fcmpgt16 Teg, p, fYegys), egyq 
tcmpgt 32 Teg, p, MCL ps2, Vegrad 
fcmplel6 Teg, p, MCL p52, egyq 
fcmple32 Teg,s], fVegyg2, VEL rd 
fcmpnel6 Teg, p, MCL p52, VCS rd 
fcmpne32 Teg, p, fYê&,s2, egyq 
fcmpegl6 Teg, p, MCL p52, VCS rd 
fcmpeg32 TCLs 1, fYegyg2, TêSyd 

Description 


Either four 


16-bit or two 32-bit fixed-point values in the 64-bit floating-point source registers 


specified by rs1 and rs2 are compared. The 4-bit or 2-bit results are stored in the least 
significant bits in the integer destination register r [rd]. Signed comparisons are used. Bit 0 
of r [rd] corresponds to the least significant 16-bit or 32-bit comparison. 


For FCMPGT, each bit in the result is set if the corresponding value in the first source operand 
is greater than the value in the second source operand. Less-than comparisons are made by 
swapping the operands. 





For FCMPLE, each bit in the result is set if the corresponding value in the first source 


operand is | 


ess than or equal to the value in the second source operand. Greater-than-or-equal 


comparisons are made by swapping the operands. 


For FCMPEQ, each bit in the result is set if the corresponding value in the first source 
operand is equal to the value in the second source operand. 





For FCMP 





E, each bit in the result is set if the corresponding value in the first source 


operand is not equal to the value in the second source operand. 


Exceptions 


fp. disabled 
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A.45 Pixel Component Distance (PDIST) (VIS I) 
ome NN 





0 0011 1110 Distance between eight 8-bit components 


Format (3) 


31 30 29 25 24 19 18 14 13 5 4 0 





Assembly Language Syntax 


pdist [resist Sessa, frega 


Description 











Eight unsigned 8-bit values are contained in the 64-bit floating-point source registers 
specified by rs1 and rs2. The corresponding 8-bit values in the source registers are 
subtracted (that is, the second source operand from the first source operand). The sum of the 
absolute value of each difference is added to the integer in the 64-bit floating-point 
destination register specified by rd. The result is stored in the destination register. Typically, 
this instruction is used for motion estimation in video compression algorithms. 


Note — For good performance, the rd operand of PDIST should not reference the result of 
a non-PDIST instruction in the five previously executed instruction groups. 


Exceptions 


fp. disabled 
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A.46 Pixel Formatting (VIS I) 


























Opcode opf Operation 

FPACK16 0 0011 1011 Four 16-bit packs into 8 unsigned bits 
FPACK32 0 0011 1010 Two 32-bit packs into 8 unsigned bit 
FPACKFIX 0 0011 1101 Four 16-bit packs into 16 signed bits 
FEXPAND 0 0100 1101 Four 16-bit expands 

FPMERGE 0 0100 1011 Two 32-bit merges 

Format (3) 


31 30 29 25 24 19 18 14 13 5 4 0 





Assembly Language Syntax 
fpack16 
fpack32 


fpackfix 








fexpand l'egrs2 Sregra 


fpmerge reSrs1» SreSrs2, fregra 














Description 


The FPACK instructions convert multiple values in a source register to a lower-precision 
fixed or pixel format and stores the resulting values in the destination register. Input values 
are clipped to the dynamic range of the output format. Packing applies a scale factor from 
GSR. scale to allow flexible positioning of the binary point. 
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A.46.1 


Programming Note — For good performance, the result of an FPACK (including 
FPACK32) should not be used as part of a 64-bit graphics instruction source operand in the 
next three instruction groups. 








FEXPAND performs the inverse of the FPACK16 operation. 

















FPMERGE interleaves four 8-bit values from each of two 32-bit registers into a single 64-bit 
destination register. 





Programming Note — The result of FEXPAND or FPMERGE should not be used as a 32- 
bit graphics instruction source operand in the next three instruction groups. 














Exceptions 


fp. disabled 


FPACK16 


FPACK16 takes four 16-bit fixed values from the 64-bit floating-point register specified by 
rs2, scales, truncates, and clips them into four 8-bit unsigned integers, and stores the results 
in the 32-bit destination register, £ [rd]. FIGURE A-12 illustrates the FPACK16 operation. 
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63 48 47 32 31 16 15 0 
rs2 | | 
4 0 4 0 
GSR.scale x0100 
15 0 

rs2 
1514 7 6 4 0 





19 








rd 











FIGURE A-12 FPACK16 Operation 





Note — FPACK16 ignores the most significant bit of GSR.scale (GSR. scale<4>). 


This operation is carried out as follows: 


1. Left-shift the value from the 64-bit floating-point register specified by rs2 by the number 
of bits specified in GSR. scale while maintaining clipping information. 


2. Truncate and clip to an 8-bit unsigned integer starting at the bit immediately to the left of 
the implicit binary point (that is, between bits 7 and 6 for each 16-bit word). Truncation 
converts the scaled value into a signed integer (that is, round toward negative infinity). If 
the resulting value is negative (that is, its most significant bit is set), zero is returned as 
the clipped value. If the value is greater than 255, then 255 is delivered as the clipped 
value. Otherwise, the scaled value is returned as the result. 


3. Store the result in the corresponding byte in the 32-bit destination register, £ [rd]. 
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A.46.2 


FPACK32 


FPACK32 takes two 32-bit fixed values from the second source operand (the 64-bit floating- 
point register specified by rs2) and scales, truncates, and clips them into two 8-bit unsigned 
integers. The two 8-bit integers are merged at the corresponding least significant byte 
positions with each 32-bit word in the 64-bit floating-point register specified by rs1, left- 
shifted by 8 bits. The 64-bit result is stored in the 64-bit floating-point register specified by 
rd. Thus, successive FPACK32 instructions can assemble two pixels by using three or four 
pairs of 32-bit fixed values. FIGURE A-13 illustrates the FPACK32 operation. 


63 56 55 48 47 40 39 32 31 24 23 16 15 8 7 0 
rs2 N | N | 
N N 


N N 
N N 











rs1 



































GSR.scale 00110 







000000 





FF 2 
rd implicit binary point 








FIGURE A-13 FPACK32 Operation 


This operation is carried out as follows: 


1. Left-shift each 32-bit value from the second source operand by the number of bits 
specified in GSR. scale, while maintaining clipping information. 


2. For each 32-bit value, truncate and clip to an 8-bit unsigned integer starting at the bit 
immediately to the left of the implicit binary point (that is, between bits 23 and 22 for 
each 32-bit word). Truncation converts the scaled value into a signed integer (that is, 
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A.46.3 


A-376 


round toward negative infinity). If the resulting value is negative (that is, MSB is set), then 
zero is returned as the clipped value. If the value is greater than 255, then 255 is delivered 
as the clipped value. Otherwise, the scaled value is returned as the result. 


3. Left-shift each 32-bit value from the first source operand (the 64-bit floating-point register 
specified by rs1) by 8 bits. 


4. Merge the two clipped 8-bit unsigned values into the corresponding least significant byte 
positions in the left-shifted value from the second source operand. 


5. Store the result in the rd register. 


FPACKFIX 


FPACKFIX takes two 32-bit fixed values from the 64-bit floating-point register specified by 
rs2, scales, truncates, and clips them into two 16-bit unsigned integers, and then stores the 
result in the 32-bit destination register f [rd]. FIGURE A-14 illustrates the FPACKFIX 





operation. 
63 32 31 16 15 0 
rs2 

















rd | | 





4 0 


GSR.scale 


rs2 





31 1615 9 0 








37 






implicit binary poin 
rd 


FIGURE A-14 FPACKF IX Operation 


UltraSPARC Illi Processor User's Manual * June 2003 


This operation is carried out as follows: 


1. Left-shift each 32-bit value from the source operand (the 64-bit floating-point register 
specified by rs2) by the number of bits specified in GSR. scale while maintaining 
clipping information. 


2. For each 32-bit value, truncate and clip to a 16-bit unsigned integer starting at the bit 
immediately to the left of the implicit binary point (that is, between bits 16 and 15 for 
each 32-bit word). Truncation converts the scaled value into a signed integer (that is, 
round toward negative infinity). If the resulting value is less than —32768, then —32768 is 
returned as the clipped value. If the value is greater than 32767, then 32767 is delivered as 
the clipped value. Otherwise, the scaled value is returned as the result. 


3. Store the result in the 32-bit destination register £ [rd]. 


A.464  FEXPAND 


FEXPAND takes four 8-bit unsigned integers from f [rs2], converts each integer to a 16-bit 
fixed-point value, and stores the four resulting 16-bit values in a 64-bit floating-point register 
specified by rd. FIGURE A-15 illustrates the operation. 




















31 24 28 16 15 8 7 0 
rs2 | | 
63 16 32 31 16 15 0 
rd | | 
7 0 
rs2 
15 12 4 0 
rd 0000 0000 








FIGURE A-15 F EXPAND Operation 





This operation is carried out as follows: 
1. Left-shift each 8-bit value by four and zero-extend the results to a 16-bit fixed value. 


2. Store the result in the destination register. 
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A.46.5 


FPMERGE 


FPMERGE interleaves four corresponding 8-bit unsigned values in £[rs1] and f[rs2] to 
produce a 64-bit value in the 64-bit floating-point destination register specified by rd. This 
instruction converts from packed to planar representation when it is applied twice in 
succession; for example, 

RIGIBIAI, R3G3B3A3 > RIR3G1G3A1A3 — RIR2R3R4G1G2G3G4. 




















FPMERGE also converts from planar to packed when it is applied twice in succession; for 
example, RIR2R3R4, BIB2B3B4 — RIBIR2B2R3B3R4B4 — RIGIBIA1R2G2B2A2. 








FIGURE A-16 illustrates the operation. 








31 24 28 46 15 8 7 0 
D 
-— 
a 
<” 31 je 23 1915 8 7 0 
d Pl 
: 2 Zee uM 
7 








FIGURE A-16 FPMERGE Operation 























Back-to-back FPMERMGEs cannot be done on adjacent cycles. 





A.47 


A-378 


Population Count 


Opcode p3 peration 





POPC 10 1110 Population Count 
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Format (3) 


10 rd op3 0 0000 i-0 — rs2 
10 rd op3 0 0000 i21 simm13 
31 30 29 25 24 19 18 14 13 12 5 4 0 


Assembly Language Syntax 
popc reg or imm, regyq 





Description 


POPC counts the number of one bits in r[rs2] if i =0, or the number of one bits in 
sign ext (simm13) ifi = 1, and stores the count in r [rd]. This instruction does not 
modify the condition codes. 


Note — The UltraSPARC IIIi processor does not implement this instruction in hardware; 


instead, it traps to software. The instruction is emulated in supervisor software. 


Exceptions 


illegal instruction 





A.48 Prefetch Data 














Opcode op3 Operation 
PREFETCH 10 1101 Prefetch Data 
PREFETCHAFASI 11 1101 Prefetch Data from Alternate Space 











Implementation Note — The PREFETCH{A} instructions are supported in the 
UltraSPARC IIIi processor. 
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11 


11 


31 30 29 


A-380 


fen 


fen 


Format (3) PREFETCH{A} 


o| PREFETCH: — 
op3 E i-0| PREFETCHA: imm asi 1S2 
op3 rs1 i=1 simm13 
25 24 19 18 14 13 12 5 4 0 

















Assembly Language Syntax 

prefetch [address], prefetch fcn 
prefetcha [regaddr] imm. asi, prefetch fcn 
prefetcha [reg plus imm] %asi, prefetch fcn 
Description 


Prefetching is used to help manage data memory cache(s). A prefetch to a non-prefetchable 
location has no effect. Non-cacheable and non-prefetchable locations are not the same. 


Variants of the prefetch instruction are used to prepare the memory system for different types 
of memory accesses. 


In non-privileged code, a prefetch instruction has no observable effect. Its execution is 
nonblocking and cannot cause an observable trap. In particular, a prefetch instruction shall 
not trap if it is applied to an illegal or nonexistent memory address. 





Programming Note — When software needs to prefetch 64 bytes beginning at an 
arbitrary address, issue two prefetch instructions to canvas all bytes: 

prefetch [address], prefetch fcn 

prefetch[address + 63], prefetch fcn 





PREFETCH A 


Prefetch instructions that do not load from an alternate address space access the primary 
address space (AST PRIMARY(_ LITTLE) ). Prefetch instructions that do load from an 
alternate address space contain the address space identifier (ASI) to be used for the load in 
the imm asi field if i — 0, or in the ASI register if i = 1. The access is privileged if bit 7 
of the ASI is zero; otherwise, it is not privileged. The effective address for these instructions 
is “r[rsl] +r[rs2]” if i=0, or “r[rsl] ^ sign ext (simm13)"ific- |l. 
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A.48.1 


Exceptions 


illegal_instruction 


Prefetch Instruction Variants 


PREFETCH(A) instructions with fen = 0-3 are implemented. 


Each prefetch variant reflects an intent on the part of the compiler or programmer. This is 
different from other instructions in SPARC-V9 (except BPN), all of which specify specific 
actions. 


The prefetch instruction variants are intended to provide scalability for future improvements 
in both hardware and compilers. 


The prefetch variant is selected by the fcn field of the instruction. In accordance with 
SPARC-V9, £cn values 4-15 cause an illegal instruction exception. 


A prefetch with £cn — 16 invalidates the P-cache line corresponding to the effective address 
of the prefetch. Use this characteristic to prefetch non-cacheable data after data are loaded 

into registers from the P-cache. A prefetch invalidate is issued to remove the data from the P- 
cache so it will not be found by a later reference. Prefetch with £cn — 20, 21, 22, 23 are new. 


TABLE A-13 lists the types of software prefetch instructions. Note that the table contains 
hexadecimal values for fcn unlike the decimal values in the explanation above. 


TABLE A-13 Types of Software Prefetch Instructions 






































E Instruction 

cn 

Value Strength Request Exclusive 

(hex) Instruction Type Prefetch into: UltraSPARC Illi Ownership 

00 Prefetch read many | P-cache and weak No 
L2-cache 

01 Prefetch read once P-cache only weak No 

02 Prefetch write many | L2-cache only weak Yes 

03 Prefetch write once! | L2-cache only weak No 

04 reserved Undefined 

05 - reserved Undefined 

OF 

10 Prefetch invalidate Invalidates a P- N/A 
cache line, no data 
is prefetched. 

11 - reserved Undefined 

13 
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TABLE A-13 Types of Software Prefetch Instructions (Continued) 














Value Request Exclusive 
(hex) Instruction Type Prefetch into: UltraSPARC Illi Ownership 

14 Same as fcn = 00 weak? No 

15 Same as fcn = 01 weak? No 

16 Same as fcn = 02 weak? Yes 

17 Same as fcn = 03 weak? No 

18 - reserved Undefined 

1F 




















1. Although the name is “prefetch write once,” the actual use is prefetch to L2-cache for a future read. 


2. These weak instructions may be implemented as strong in future implementations. 


A.48.2 New Error Handling of PREFETCH,2 and Other 
Prefetches 


Since PREFETCH,2 request for cache line ownership (RTO/R, RTO), an error occurs while 
processing it will be handled differently compared to other prefetch requests with RTS/ 
R_RTS, as described in TABLE A-14. 
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TABLE A-14 Error Handling of Prefetch Requests 











Prefetch Type | L2-cache | Error Type L2-cache Action P-cache Error Trap 
Hit/Miss Action Logging 
PREFETCH,2 Hit Tag, No state change None THCE Disrupting 
(RTO/R_RTO) Hardware-corrected 
Miss Tag, Install data, state None THCE Disrupting 
Hardware-corrected | change to M 
“Hit” Tag, uncorrectable No data install, None TUE Fatal Error 
(tag error) no state change 
Hit Data, No state change None EDC Disrupting 
Hardware-corrected 
Hit Data, No state change None EDU Disrupting 
uncorrectable 
Miss Data, Install data, state None CE Disrupting 
Hardware-corrected | change to M 
Miss Data, Install uncorrected None DUE Disrupting 
uncorrectable data, state change to M 
Miss Mtag, Install data, state None EMC Disrupting 
Hardware-corrected | change to M 
Miss Mtag, Install data if L2-cache | None EMU Fatal Error 
uncorrectable state is M or Os 
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Prefetch Type 


PREFETCH,0 
PREFETCH,1 
PREFETCH,3 


Hardware 
prefetch 


(RTS/R_RTS) 





TABLE A-14 Error Handling of Prefetch Requests (Continued) 














L2-cache | Error Type L2-cache Action P-cache Error Trap 
Hit/Miss Action Logging 
Hit Tag, No state change Install data THCE Disrupting 
Hardware-corrected (except 
PREFETCH, 
3) 
Miss Tag, Install data, state Install data THCE Disrupting 
Hardware-corrected | change to S or E (except 
PREFETCH, 
3) 
“Hit” Tag, No data install, Cancel TUE Fatal Error 
(tag error) | uncorrectable no state change install 
Hit Data, No state change Install data EDC Disrupting 
Hardware-corrected (except 
PREFETCH, 
3) 
Hit Data, No state change Cancel EDU Disrupting 
uncorrectable install 
Miss Data, Install data, state Install data CE Disrupting 
Hardware-corrected | change to S or E (except 
PREFETCH, 
3) 
Miss Data, -If RTS, cancel install, | Cancel DUE Disrupting 
uncorrectable no state change. install 
-If R_RTS, install 
uncorrected data, state 
change to Os. 
Miss Mtag, Install data, state None EMC Disrupting 
Hardware-corrected | change to S or E 
Miss Mtag, Install data if L2-cache | None EMU Fatal Error 
uncorrectable state is M or Os 




















A.48.2.1 


A-384 


New Column in Coherence Table 


A new column has been added to the UltraSPARC IIIi Coherence Table to describe the 





processor action on write prefetch RTO. Basically, the behavior of coherence state change is 
the following: 


On L2-cache hit: same as Load reguest (no state change) 


On L2-cache miss: same as Store reguest (send RTO/R_RTO to get M state) 
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A.49 


Read Privileged Register 











Opcode op3 Operation 
RDPRP 10 1010 Read Privileged Register 
Format (3) 





Pee eee EC. SN EN 


31 30 29 


25 24 19 18 


a 
a 


Privileged Register 


14 13 





TPC 





TNPC 
TSTATE 
T 





TICK 





TBA 
PSTATE 
TL 





PIL 





CO} o] AT | Mm] AISA NM] Rr] eo 


CWP 


— 
© 


CANSAVE 
CANRESTORE 


um 
= 





— 
N 


CLEANWIN 





— 
Ww 


OTHERWIN 


= 
A 


WSTATE 
FO 


— 
Un 





16—30 











U 
— 


VER 
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Assembly Language Syntax 

rdpr Stpc, reg,g 

rdpr Stnpc, reg,g 
rdpr $tstate, regu 
rdpr Stt, regu 

rdpr Stick, reg, 
rdpr $tba, reg,g 

rdpr $pstate, reg,g 
rdpr Stl, regu 

rdpr Spil, reg,q 

rdpr SCWP, legyq 

rdpr $cansave, reg, 
rdpr $canrestore, reg,d 
rdpr $cleanwin, reg,g 
rdpr Sotherwin, reg,g 
rdpr $wstate, reg,g 
rdpr Sfq, regu 

rdpr $ver, legyd 
Description 


The rs1 field in the instruction determines the pri 


vileged register that is read. There are 





MAXTL copies of the TPC, TNPC, TT, and TSTATE registers. A read from one of these 


registers returns the value in the register indexed b 


y the current value in the trap level 








register (TL). A read of TPC, TNPC, TT, or TSTA 
causes an illegal instruction exception. 





E when the trap level is zero (TL = 0) 


RDPR instructions with rs1 in the range 16 —30 are reserved; executing an RDPR instruction 


with rs1 in that range causes an illegal instructio 
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Programming Note — On this implementation with precise floating-point traps, the 
address of a trapping instruction will be in the TPC [TL] register when the trap code begins 


execution. 


Exceptions 


privileged opcode 
illegal instruction ((rs1 —16—30) or ((rs1 < 3) and (TL =0))) 
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A.50 Read State Register 






































Opcode op3 rs1 Operation 

RDYD 10 1000 0 Read Y Register; deprecated (see Section A.70.9, “Read Y 
Register”) 

— 10 1000 1 Reserved, do not access; attempt to access causes in 
illegal_instruction exception. 

RDCCR 10 1000 2 Read Condition Codes Register 

RDASI 10 1000 3 Read ASI Register 

RDTICKPNT 10 1000 Read Tick Register 

RDPC 10 1000 5 Read Program Counter 

RDFPRS 10 1000 6 Read Floating-Point Registers Status Register 

— 10 1000 7—14 Reserved, do not access; attempt to access causes in 
illegal instruction exception. 

See section description 10 1000 15 STBAR, MEMBAR, or Reserved; see section description. 

RDASR 101000 [16-31 [Read non-SPARC-V9 ASRs 

RDPCRPPCR 16 Read Performance Control Registers (PCR) 

RDP ICPC Read Performance Instrumentation Counters (PIC) 

RDDCRP |8 [Read Dispatch Control Register (DCR) 

RDGSR |9 [Read Graphic Status Register (GSR) 

— 20-21 Reserved, do not access; attempt to access causes in 
illegal instruction exception. 

RDSOFTINT? 22 Read per-processor Soft Interrupt Register 

RDTICK_CMPR? 23 Read Tick Compare Register 

RDSTICKPN?T |24 [Read System TICK Register 

RDSTICK CMPRP 25 [Read System TICK Compare Register 

— 26-31 Reserved, do not access; attempt to access causes in 
illegal instruction exception. 

Format (3) 
10 rd op3 rs1 i-0 — 
31 30 29 25 24 19 18 14 13 12 
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Assembly Language Syntax 

rd $ccr,reg, 

rd $asi,reg, 

rd Stick, reg, 

rd SPC, VEL pg 

rd $fprs,reg, 

rd $pcr, reg, 

rd Spic, reg,g 

rd $dcr, reg, 

rd $gsr, reg, 

rd Ssoftint, reg, 
rd Stick_cmpr, reg, 
rd Ssys_tick, reg,g 
rd Ssys_tick_cmpr, reg,q 
Description 


These instructions read the state register specified by rs1 into r [rd]. 


Values 7-14 of rs1 are reserved for future versions of the architecture. A Read State 
Register instruction with rs1 = 15, rd = 0, and i = 0 is defined to be a (deprecated) STBAR 
instruction (see Section A.70.10, “Store Barrier"). An RDASR instruction with rs1 = 15, 
rd=0, and i = 1 is defined to be a MEMBAR instruction. RDASR with rs1 = 15 and rd #0 
is reserved for future versions of the architecture; it causes an illegal instruction exception. 





For RDPC, the processor writes the full 64-bit program counter value to the destination 
register of a CALL, JMPL, or RDPC instruction. When PSTATE.AM = 1 and a trap occurs, 
the processor writes the full 64-bit program counter value to TPC[TL]. 





RDFPRS waits for all pending FPops and loads of floating-point registers to complete before 
reading the FPRS register. 





RDGSR causes a fp. disabled exception if PSTATE.PEF =0 or FPRS.FEF - 0. 











RDTICK causes a privileged action exception if PSTATE.PRIV=0 and TICK.NPT - l. 
RDSTICK causes a privileged action exception if PSTATE.PRIV —0 and STICK.NPT - 1. 

















RDPIC causes a privileged action exception if PSTATE.PRIV =0 and PCR.PRIV - 1. 


RDPCR causes a privileged opcode exception due to access privilege violation. 


Chapter A Instruction Definitions A-389 


Implementation Note — Ancillary state registers include, for example, timer, counter, 
diagnostic, self-test, and trap-control registers. 








Compatibility Note — The SPARC-V8 RDPSR, RDWIM, and RDTBR instructions do not 
exist in SPARC-V9 since the PSR, WIM, and TBR registers do not exist in SPARC-V9. 





Exceptions 


privileged_opcode(RDDCR, RDSOFTINT, RDTICK_CMPR, RDSTICK, RDSTICK_CMPR, 
and RDPCR) 
illegal instruction(RDASR with rs1 = 1 or 7-14; 
RDASR with rs1 = 15 and rd z 0; 
RDASR with rs1 = 20-21, 26-31) 
privileged action (RDTICK with PSTATE.PRIV =0 and TICK.NPT- 1; 
RDPIC with PSTATE.PRIV =0 and PCR.PRIV=1; 
RDSTICK with PSTATE.PRIV —0 and STICK.NPT = 1) 
fp disabled (RDGSR with PSTATE.PEF =0 or FPRS.FEF = 0) 


























A.51 


RETURN 





Opcode op3 Operation 


Format (3) 


MIC e [=f I 


A-390 


25 24 19 18 14 13 12 5 4 0 
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Assembly Language Syntax 


return address 





Description 





The RETURN instruction causes a delayed transfer of control to the target address and has the 
window semantics of a RESTORE instruction; that is, it restores the register window prior to 
the last SAVE instruction. The target address is ^r[rs1] +r[rs2]” if i=0, or 
“r[rsl] +sign_ext (simm13)” if i = 1. Registers r[rs1] and r[rs2] come from 
the old window. 




















The RETURN instruction may cause an exception. It may cause a window fill exception as 
part of its RESTORE semantics, or it may cause a mem address not aligned exception if 
either of the two low-order bits of the target address is nonzero. 














Programming Note — To re-execute the trapped instruction when returning from a user 
trap handler, use the RETURN instruction in the delay slot of a JMPL instruction, for 





example: 
jmpl $16,$g0 | Trapped PC supplied to user trap handler 
return S17 | Trapped nPC supplied to user trap handler 








Programming Note — A routine that uses a register window may be structured either as: 


save $sp,-framesize, $sp 
ret | Same as jmpl $i7 t8, $g0 
restore | Something useful like "restore 


| %02,%12,%00" 


Or, 
save $sp,-framesize, $sp 
return $i7 T8 
nop | Could do some useful work in the caller's 


| window, for example, "or $01, $o2,$o0” 
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Exceptions 


mem_address_not_aligned 
fill_n_normal (n = 0-7) 
fill n other (n = 0-7) 





A.52 | SAVE and RESTORE 
































Opcode op3 Operation 

SAVE 11 1100 Save Caller's Window 
RESTOR 11 1101 Restore Caller's Window 
Format (3) 


s Ic EI 


19 18 14 13 12 5 4 0 





Assembly Language Syntax 





Save 


reg,,j, leg or imm, reg,4 








restore 





reg,,j, leg or imm, reg,4 





Description (Effect on Non-Privileged State) 


The SAVI 





E instruction provides the routine executing it with a new register window. The out 


registers from the old window become the in registers of the new window. The contents of 
the out and the local registers in the new window are zero or contain values from the 
executing process; that is, the process sees a clean window. 





The R 


ESTORE 











instruction restores the register window saved by the last SAVE instruction 





executed by the current process. The in registers of the old window become the out registers 
of the new window. The in and local registers in the new window contain the previous values. 


A-392 
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Furthermore, if and only if a spill or fill trap is not generated, SAVE and RESTORE behave 
like normal ADD instructions, except that the source operands r[rs1] or r[rs2] are read 
from the old window (that is, the window addressed by the original CWP) and the sum is 
written into r[rd] of the new window (that is, the window addressed by the new CWP). 








Note — Cw? arithmetic is performed modulo the number of windows, NWINDOWS. 























Programming Note — Typically, if a SAVE (RESTORE) instruction traps, the spill (fill) 
trap handler returns to the trapped instruction to reexecute it. So, although the ADD operation 
is not performed the first time (when the instruction traps), it is performed the second time 
the instruction executes. The same applies to changing the CWP. 


The SAVE instruction can be used to atomically allocate a new window in the register file 
and a new software stack frame in memory. 














There is a performance trade-off to consider between using SAVE/RESTORE and saving and 
restoring selected registers explicitly. 





Description (Effect on Privileged State) 


If the SAVE instruction does not trap, it increments the CWP (mod NWINDOWS) to provide a 
new register window and updates the state of the register windows by decrementing 
CANSAVE and incrementing CANRESTORE. 




















If the new register window is occupied (that is, CANSAVE = 0), a spill trap is generated. The 
trap vector for the spill trap is based on the value of OTHERWIN and WSTATE. The spill trap 
handler is invoked with the CWP set to point to the window to be spilled (that is, old 

CWP + 2). 











If CANSAVE # 0, the SAVE instruction checks whether the new window needs to be cleaned. 
It causes a clean_window trap if the number of unused clean windows is zero, that is, 
(CLEANWIN — CANRESTORE) = 0. The clean_window trap handler is invoked with the CWP 
set to point to the window to be cleaned (that is, old CWP + 1). 


























If the RESTORE instruction does not trap, it decrements the CWP (mod NWINDOWS) to 
restore the register window that was in use prior to the last SAVE instruction executed by the 
current process. It also updates the state of the register windows by decrementing 
CANRESTORE and incrementing CANSAVE. 






































If the register window to be restored has been spilled (CANRESTORE = 0), then a fill trap is 
generated. The trap vector for the fill trap is based on the values of OTHERWIN and 
WSTATE. The fill trap handler is invoked with CWP set to point to the window to be filled, 
that is, old CWP — 1. 
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Programming Note — The vectoring of spill and fill traps can be controlled by setting the 
value of the OTHERWIN and WSTATE registers appropriately. 























The spill (fill) handler normally will end with a SAVED (RESTORED) instruction followed by 
a RETRY instruction. 





Exceptions 





clean window (SAVE only) 








fill n normal (RESTORE only, n 20-7) 
fill n other (RESTORE only, n = 0-7) 























spill n normal (SAVE only, n = 0-7) 
spill n other (SAVE only, n = 0-7) 








A.53 


SAVED and RESTORED 


EDP Window has been saved 











Window has been restored 














Format (3) 


31 30 29 


A-394 


0 





Assembly Language Syntax 





saved 








restored 








Description 





SAVED and RESTORED adjust the state of the register-windows control registers. 
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SAVED increments CANSAVE. If OTHERWIN = 0, SAVED decrements CANRESTORE. 
If OTHERWIN 0, it decrements OTHERWIN. 





















































RESTORED increments CANRESTORE. If CLEANWIN < (NWINDOWS-1), then RESTORED 
increments CLEANWIN. If OTHERWIN = 0, it decrements CANSAVE. If OTHERWIN #0, it 
decrements OTHERWIN. 






























































Programming Note — The spill (fill) handlers use the SAVED (RESTORED) instruction to 
indicate that a window has been spilled (filled) successfully. 

















Normal privileged software would probably not do a SAVED or RESTORED from trap level 
zero (TL — 0). However, it is not illegal to do so and doing so does not cause a trap. 




















Executing a SAVED (RESTORED) instruction outside of a window spill (fill) trap handler is 
likely to create an inconsistent window state. Hardware will not signal an exception, 
however, since maintaining a consistent window state is the responsibility of privileged 
software. 


Exceptions 


privileged opcode 
illegal instruction (£cn = 2-31) 





A.54 Set Interval Arithmetic Mode (VIS II) 


eme [e mm O 





0 1000 0001 Set the interval arithmetic mode fields in the GSR 


Format (3) 


5 4 3 2 0 


31 30 29 25 24 19 18 14 13 
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A-396 


Assembly Language Syntax 


siam mode 





Description 
The SIAM instruction sets the GSR. IM and GSR.IRND fields as follows: 
GSR.IM = mode<2> 
GSR.IRND = mode<1:0> 
Note — SIAM is a groupable, break-after instruction. It enables the interval rounding mode 


to be changed every cycle without flushing the pipeline. FPops in the same instruction group 
as an SIAM instruction use the previous rounding mode. 


Exceptions 


fp. disabled 
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A.55 


00 


31 30 29 


SETHI 





Opcode op2 Operation 
SETHI 100 Set High 22 Bits of Low Word 
Format (2) 
op2 imm22 
25 24 2221 0 





Assembly Language Syntax 





sethi const22, reg, 
sethi Shi (value), reg, 
Description 





SETHI zeroes the least significant 10 bits and the most significant 32 bits of r[rd] and 
replaces bits 31 through 10 of r[rd] with the value from its imm22 field. 











SETHI does not affect the condition codes. 





Some SETHI instructions with rd = 0 has a special use: 
rd O0 and imm22 = 0: has no architectural effect and is defined to be a NOP instruction 


rd=0 and imm22 # 0 is used to trigger hardware performance counters. See Chapter 11 
“Performance Instrumentation" for details. 





Programming Note — The most common form of 64-bit constant generation is creating 
stack offsets whose magnitude is less than 22. The code below can be used to create the 
constant 0000 0000 ABCD 1234;4: 


sethi $hi(0xabcd1234),$00 
or $00, 0x234, $00 


The following code shows how to create a negative constant. Note: The immediate field of 
the xor instruction is sign extended and can be used to get ones in all of the upper 32 bits. 
For example, to set the negative constant FFFF FFFF ABCD 1234,4: 


sethi $hi(0x5432edcb),$00 ! note 0x5432EDCB, not OxABCD1234 
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xor $00, Oxle34, %00 ! part of imm. overlaps upper bits 





Exceptions 


None 





A.56 . Shift 




















Opcode op3 x Operation 
SLL 10 0101 0 Shift Left Logical — 32 bits 
SRL 10 0110 0 Shift Right Logical — 32 bits 
SRA 10 0111 0 Shift Right Arithmetic — 32 bits 
SLLX 10 0101 1 Shift Left Logical — 64 bits 
SRLX 10 0110 1 Shift Right Logical — 64 bits 
SRAX 10 0111 1 Shift Right Arithmetic — 64 bits 
Format (3) 
10 rd op3 rs1 i-0| x — rs2 
10 rd op3 rs1 i=1)x=0 — shent32 
10 rd op3 rs1 i-1|x-1 — shcnt64 
31 30 29 25 24 19 18 14 13 12 6 5 4 0 
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Assembly Language Syntax 








sll reg,,j, l'eg or shcnt, reg, 
srl reg], reg_or_shcnt, reg,g 
sra reg,,j, reg_or_shcnt, reg, 
sllx reg], reg_or_shcnt, reg,q 
srlx reg,.j, reg_or_shcnt, reg, 
srax reg,,j, reg_or_shcnt, reg, 
Description 


When i =0 and x = 0, the shift count is the least significant five bits of r[rs2]. When 

i =0 and x = 1, the shift count is the least significant six bits of r[rs2]. When i = 1 and 
x =0, the shift count is the immediate value specified in bits 0 through 4 of the instruction. 
When i = 1 and x = 1, the shift count is the immediate value specified in bits 0 through 5 of 
the instruction. 


TABLE A-15 shows the shift count encodings for all values of i and x. 


TABLE A-15 Shift Count Encodings 


Shift Count 
bits 4-0 of r[rs2] 
bits 5-0 of r[rs2] 





bits 4—0 of instruction 


ep Fe o o 
— o — ojx 


bits 5—0 of instruction 





SLL and SLLX shift all 64 bits of the value in r [rs1] left by the number of bits specified 
by the shift count, replacing the vacated positions with zeroes, and write the shifted result to 
era). 


SRL shifts the low 32 bits of the value in r[rs1] right by the number of bits specified by 
the shift count. Zeroes are shifted into bit 31. The upper 32 bits are set to zero, and the result 
is written to r[rd]. 


SRLX shifts all 64 bits of the value in r[rs1] right by the number of bits specified by the 
shift count. Zeroes are shifted into the vacated high-order bit positions, and the shifted result 
is written to r[rd]. 


SRA shifts the low 32 bits of the value in r[rs1] right by the number of bits specified by 
the shift count and replaces the vacated positions with bit 31 of r[rs1]. The high-order 
32 bits of the result are all set with bit 31 of x [rs1], and the result is written to r [rd]. 
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SRAX shifts all 64 bits of the value in r[rs1] right by the number of bits specified by the 
shift count and replaces the vacated positions with bit 63 of r[rs1]. The shifted result is 
written to r [rd]. 


No shift occurs when the shift count is zero, but the high-order bits are affected by the 32-bit 
shifts as noted above. 


These instructions do not modify the condition codes. 





Programming Note — “Arithmetic left shift by 1 (and calculate overflow)" can be 
effected with the ADDcc instruction. 


The instruction “sra rs1,0,rd" can be used to convert a 32-bit value to 64 bits, with 
sign extension into the upper word; “srl rs1,0,rd" can be used to clear the upper 
32 bits of r [rd]. 


Exceptions 


None 





A.57 


A-400 








Short Floating-Point Load and Store (VIS I) 














Opcode mm_asi IASI Value Operation 

LDDFA ASI_FL8_P D016 8-bit load/store from/to primary address space 

STDFA 

LDDFA ASI_FL8_S Dig 8-bit load/store from/to secondary address space 

STDFA 

LDDFA ASI_FL8_PL D816 8-bit load/store from/to primary address space, little-endian 
STDFA 

LDDFA ASI_FL8_SL D916 8-bit load/store from/to secondary address space, little-endian 
STDFA 

LDDFA ASI_FL16_P D216 16-bit load/store from/to primary address space 

STDFA 

LDDFA ASI_FL16_S D316 16-bit load/store from/to secondary address space 

STDFA 

LDDFA ASI FL16 PL |DA;g 16-bit load/store from/to primary address space, little-endian 
STDFA 

LDDFA ASI FL16 SL |DBy¢ 16-bit load/store from/to secondary address space, little-endian 
STDFA 
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Format (3) LDDFA 


= [= Mo me [| 


31 30 29 25 24 19 18 14 13 5 4 


o 


Format (3) STDFA 


mw | m [| 


31 30 29 25 24 19 18 14 13 5 4 


o 


Assembly Language Syntax 





ldda [reg_addr] imm. asi, freg,g 
ldda [reg plus imm] $asi, freg,g 
stda Jfreg,g, [reg addr] imm asi 
stda freg,a, [reg. plus imm] Sasi 
Description 


Short floating-point load and store instructions are selected by means of one of the short 
ASIs with the LDDFA and STDFA instructions. 


These ASIs allow 8- and 16-bit loads or stores to be performed to/from the floating-point 
registers. Eight-bit loads can be performed to arbitrary byte addresses. For 16-bit loads, the 
least significant bit of the address must be zero or a mem address not aligned trap is taken. 
Short loads are zero-extended to the full floating-point register. Short stores access the low- 
order 8 or 16 bits of the register. 


Little-endian ASIs transfer data in little-endian format in memory; otherwise, memory is 
assumed to be big-endian. Short loads and stores are typically used with the FALIGNDATA 
instruction (see Section A.2, "Alignment Instructions (VIS I)") to assemble or store 64 bits 
on noncontiguous components. 
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Exceptions 


fp. disabled 

PA watchpoint 

VA watchpoint 

mem address not aligned (odd memory address for a 16-bit load or store) 
data access exception 

data access error 

fast data access MMU miss 

fast data access protection 








A.58 SHUTDOWN (VIS I) 





INE NNI. NN RNC 


SHUTDOWN? 0 1000 0000 Shut down to enter power-down mode 





Format (3) 


31 30 29 25 24 19 18 14 13 5 4 0 





Assembly Language Syntax 





shutdown 


Description 
SHUTDOWN is a privileged instruction. 


The SHUTDOWN instruction executes as a NOP. An external system signal is used to enter 
and leave Low Power mode. 


Because SHUTDOWN is a privileged instruction, an attempt to execute it while in non- 
privileged mode causes a privileged_opcode trap. 
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Exceptions 


privileged_opcode 





A.59 


31 30 29 


Software-Initiated Reset 








Opcode op3 rd Operation 
SIR 11 0000 15 Software-Initiated Reset 
Format (3) 





25 24 19 18 14 13 12 0 





Assembly Language Syntax 





sir simm13 


Description 


SIR is used to generate a software-initiated reset (SIR). As with other traps, a software- 
initiated reset performs different actions when TL = MAXTL than it does when TL < MAXTL. 


When executed in non-privileged mode, SIR acts like a NOP with no visible effect. 


Exceptions 


software_initiated_reset 
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A.60 _ Store Floating-Point 





Opcode op3 rd Operation 

STF 10 0100 0-31 Store Floating-Point Register 

STDF 10 0111 t Store Double Floating-Point Register 

STOF 10 0110 T Store Quad Floating-Point Register 

STXFSR 10 0101 1 Store Floating-Point State Register 
— 10 0101 2-31 Reserved 





Î Encoded floating-point register value. 


Format (3) 


BIG IL 1HE — |. 


31 30 29 25 24 19 18 14 13 12 5 4 0 


Assembly Language Syntax 


st freg,g, [address] 
std freg,g, [address] 
stag freg,g, [address] 
stx $fsr, [address] 
Description 


The store single floating-point instruction (STF) copies £ [rd] into memory. 


The store double floating-point instruction (STDF) copies a doubleword from a double 
floating-point register into a word-aligned doubleword in memory. 


The store guad floating-point instruction (STOF) traps to software. 
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The store floating-point state register instruction (STXFSR) waits for any currently executing 
FPop instructions to complete, and then it writes all 64 bits of the FSR into memory. 


STXFSR zeroes FSR. ftt. after writing the FSR to memory. 





Implementation Note — FSR. ftt should not be zeroed until it is known that the store 
will not cause a precise trap. 





The effective address for these instructions is “r[rsl] ^r[rs2]"ifi = 0, or 
“r[rsl] +sign_ext (simm13)” ifi=1. 


STF requires word alignment otherwise a mem_address_not_aligned exception occurs. 


STDF instruction causes a STDF_mem_address_not_aligned trap if the effective address is 
32-bit aligned but not 64-bit (doubleword) aligned. In this case, the trap handler software 
shall emulate the STDF instruction and return. 





STXFSR requires doubleword alignment; otherwise, it causes a mem_address_not_aligned 
exception. In this case, the trap handler software shall emulate the STXFSR instruction and 
return. 





If the floating-point unit is not enabled for the source register rd (per FPRS.FEF and 
PSTATE.PEF) or if the FPU is not present, then a store floating-point instruction causes a 
fp. disabled exception. 














Programming Note — In SPARC-V8, some compilers issued sets of single-precision 
stores when they could not determine that doubleword or quadword operands were properly 
aligned. For SPARC-V9, since emulation of misaligned stores is expected to be fast, it is 
recommended that compilers issue sets of single-precision stores only when they can 
determine that doubleword or quadword operands are not properly aligned. 


Exceptions 


illegal instruction (op3 = 25,6 and rd = 2-31) 
fp. disabled 

mem, address not aligned 

STDF mem address not aligned (STDF only) 
data access exception 

data access error 

fast data access MMU miss 

fast data access protection 

PA watchpoint 

VA watchpoint 
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A.61 _ Store Floating-Point into Alternate Space 





Opcode 





STFAASI 
STDFAPASI 
sTOFAFASI 


op3 rd Operation 

11 0100 0-31 Store Floating-Point Register to Alternate Space 

11 0111 t Store Double Floating-Point Register to Alternate Space 
11 0110 t Store Quad Floating-Point Register to Alternate Space 


+ Encoded floating-point register value. 


Format (3) 
11 rd op3 rs1 i=0 imm_asi rs2 
11 rd op3 rs1 i=1 simm13 
31 30 29 25 24 19 18 14 13 12 5 4 0 


Assembly Language Syntax 





sta frega, [reg_plus_imm] %asi 
stda freg,q [regaddr] imm asi 
stda frega, [reg plus imm] $asi 
stqa freg,a [regaddr] imm asi 
stqa freg,g, [reg plus imm] “asi 
Description 


The store single floating-point into alternate space instruction (STFA) copies £ [rd] into 


memory. 





The store double floating-point into alternate space instruction (STDF A) copies a doubleword 
from a double floating-point register into a word-aligned doubleword in memory. 


The store quad floating-point into alternate space instruction (STQFA) traps to software. 


A-406 
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Store floating-point into alternate space instructions contain the address space 

identifier (AST) to be used for the load in the imm asi field if i = 0 or in the ASI register if 
i = 1. The access is privileged if bit 7 of the ASI is zero; otherwise, it is not privileged. The 
effective address for these instructions is ^r [rs1] +r[rs2]” if i =0, or 

“r[rsl] *sign ext (simm13)” ifi=1. 


STFA requires word alignment; otherwise, a mem address not aligned exception occurs. 





STDFA instruction causes a STDF. mem address not aligned trap if the effective address is 
32-bit aligned but not 64-bit (doubleword) aligned. In this case, the trap handler software 
shall emulate the STDF instruction and return. 


STDFA with certain target ASI is defined to be a 64-byte block-store instruction. See 
Section A.4, *Block Load and Block Store (VIS I)" for details. 





If the floating-point unit is not enabled for the source register rd (per FPRS.FEF and 
PSTATE.PEF) or if the FPU is not present, store floating-point into alternate space 
instructions cause a fp. disabled exception. 

















Implementation Note — This check is not made for STOFA. STFA and STDFA cause a 
privileged action exception if PSTATE . PRIV =0 and bit 7 of the ASI is zero. 








Programming Note — In SPARC-V8, some compilers issued sets of single-precision 
stores when they could not determine that doubleword or quadword operands were properly 
aligned. For SPARC-V9, since emulation of misaligned stores is expected to be fast, 
compilers are recommended to issue sets of single-precision stores only when they can 
determine that doubleword or quadword operands are not properly aligned. 





Exceptions 


illegal instruction 

fp. disabled 

mem, address not aligned 
STDF mem address not aligned (STDFA only) 
privileged action 

data access exception 

data access error 

fast data access MMU miss 
fast data, access protection 
PA, watchpoint 

VA watchpoint 
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A.62 Store Integer 


Opcode op3 Operation 
STB 00 0101 Store Byte 
STH 00 0110 Store Halfword 
STW 00 0100 Store Word 
STX 00 1110 Store Extended Word 
Format (3) 
= l= HI 
11 rd op3 rs1 i=1 simm13 
31 30 29 25 24 19 18 14 13 12 5 4 0 





Assembly Language Syntax 





stb reg,g, [address] (synonyms: stub, stsb) 
sth reg,g, [address] (synonyms: stuh, stsh) 
stw reg,g, [address] (synonyms: st, stuw, st sw) 
stx reg,g, [address] 

Description 


The store integer instructions (except store doubleword) copy the whole extended (64-bit) 
integer, the less significant word, the least significant halfword, or the least significant byte of 
r[rd] into memory. 


The effective address for these instructions is “r [rs1] ^ r[rs2]"ifi =0, or 
“r[rsl] sign ext (simm13)” ifi=1. 


A successful store (notably, store extended) instruction operates atomically. 
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STH causes a mem_address_not_aligned exception if the effective address is not halfword 
aligned. STW causes a mem_address_not_aligned exception if the effective address is not 


word aligned. STX causes a mem_address_not_aligned exception if the effective address is 
not doubleword aligned. 


Exceptions 


mem_address_not_aligned (all except STB) 
data_access_exception 

data_access_error 

fast data access MMU miss 

fast data, access protection 

PA watchpoint 

VA watchpoint 








A.63 


Store Integer into Alternate Space 


Opcode op3 Operation 

STBAPASI 01 0101 Store Byte into Alternate Space 

STHAPASI 01 0110 Store Halfword into Alternate Space 
STWAPASI 01 0100 Store Word into Alternate Space 

STXAP^s 01 1110 Store Extended Word into Alternate Space 
Format (3) 


s [M == | 


11 


31 30 29 


rd 


op3 rs1 i=1 simm13 


25 24 19 18 14 13 12 5 4 0 
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A-410 


Assembly Language Syntax 





stba reg,g, [regaddr] imm. asi (synonyms: stuba, stsba) 
stha reg,q, |regaddr] imm asi (synonyms: stuha, stsha) 
stwa reg,g, [regaddr] imm. asi (synonyms: sta, stuwa, stswa) 
stxa reg,q, |regaddr] imm asi 

stba reg,g, |reg. plus imm] $asi (synonyms: scuba, stsba) 
stha reg,g, |reg. plus imm] Sasi (synonyms: stuha, stsha) 
stwa reg,g, |reg. plus imm] $asi (synonyms: sta, stuwa, stswa) 
stxa reg,g, |reg. plus imm] Sasi 

Description 


The store integer into alternate space instructions copy the whole extended (64-bit) integer, 
the less significant word, the least significant halfword, or the least significant byte of r [rd] 
into memory. 


Store integer to alternate space instructions contain the address space identifier (ASI) to be 
used for the store in the imm asi field if i — 0, or in the ASI register if i = 1. The access 
is privileged if bit 7 of the ASI is zero; otherwise, it is not privileged. The effective address 
for these instructions is “r [rs1] +r[rs2]” if i =0, or “r[rs1] +sign_ext (simm13)” 
ifi-l. 


A successful store (notably, store extended) instruction operates atomically. 


STHA causes a mem address not aligned exception if the effective address is not halfword 
aligned. STWA causes a mem address not aligned exception if the effective address is not 
word aligned. STXA causes a mem address not aligned exception if the effective address is 
not doubleword aligned. 


A store integer into alternate space instruction causes a privileged action exception if 
PSTATE.PRIV - 0 and bit 7 of the ASI is zero. 








Compatibility Note — The SPARC-V8 STA instruction is renamed STWA in SPARC-V9. 


Exceptions 


privileged action 

mem address not aligned (all except STBA) 
data access exception 

data access error 
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fast data access MMU miss 
fast data access protection 
PA watchpoint 

VA watchpoint 








A.64 


Subtract 








Opcode op3 Operation 
SUB 00 0100 Subtract 
SUBcc 01 0100 Subtract and modify condition codes 


SUBC 00 1100 
SUBCcc 01 1100 


Format (3) 


Assembly Language Syntax 


sub reg,.j, l'eg or imm, 
subcc reg,,j, l'eg or imm, 
subc TeSys], l'eg or imm, 
subccc reg,,j, l'eg or imm, 
Description 


These instructions compute “r[rsl] —r[rs2]” if i =0, or 


rs1 


VEL rd 
VEL rd 
regra 


Tê&rd 


Subtract with Carry 
Subtract with Carry and modify condition codes 


14 13 12 


“r[rsl] -sign_ext (simml3)” if i= 1, and write the difference into r [rd]. 


SUBC and SUBCcc (“subtract with carry”) also subtract the CCR register's 32-bit carry 


(icc.c) bit; that is, they compute “r[rsl] — r[rs2] -icc.c” or 


Chapter A 


Instruction Definitions 


A-411 


“r[rsl] -sign ext (simm13) — icc.c,” and write the difference into r [rd]. 


SUBcc and SUBCcc modify the integer condition codes (CCR. icc and CCR.xcc). A 32- 
bit overflow (CCR. icc.v) occurs on subtraction if bit 31 (the sign) of the operands differs 
and bit 31 (the sign) of the difference differs from r[rs1]<31>. A 64-bit overflow 
(CCR.xcc.v) occurs on subtraction if bit 63 (the sign) of the operands differs and bit 63 
(the sign) of the difference differs from r [rs1]<63>. 


Programming Note — A SUBcc with rd = 0 can be used to effect a signed or unsigned 
integer comparison. 


SUBC and SUBCcc read the 32-bit condition codes’ carry bit (CCR.icc.c), not the 64-bit 
condition codes’ carry bit (CCR. xcc.c). 


Exceptions 


None 





A.65 Tagged Add 








Opcode op3 Operation 
TADDcc 10 0000 Tagged Add and modify condition codes 
Format (3) 


s Ic EI 


31 30 29 25 24 19 18 14 13 12 5 4 0 





Assembly Language Syntax 





taddcc res], l'eg or imm, reg,g 
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Description 


This instruction computes a sum that is *r[rs1] Fr[rs2]” if i =0, or 
“r[rsl] sign ext (simm13)” ifi=1. 


TADDcc modifies the integer condition codes (icc and xcc). 


A tag overflow condition occurs if bit 1 or bit 0 of either operand is nonzero or if the addition 
generates 32-bit arithmetic overflow (that is, both operands have the same value in bit 31 and 
the sum of bit 31 is different). 


If a TADDcc causes a tag overflow, the 32-bit overflow bit (CCR. icc.v) is set to one; if 
TADDcc does not cause a tag overflow, CCR. icc.v is set to zero. 


In either case, the remaining integer condition codes (both the other CCR. icc bits and all 
the CCR. xcc bits) are also updated as they would be for a normal ADD instruction. In 
particular, the setting of the CCR.xcc.v bit is not determined by the tag overflow condition 
(tag overflow is used only to set the 32-bit overflow bit). CCR.xcc.v is set only, based on 
the normal 64-bit arithmetic overflow condition, like a normal 64-bit add. 


Exceptions 


None 





A.66 Tagged Subtract 





Opcode op3 Operation 
TSUBcc 10 0001 Tagged Subtract and modify condition codes 
Format (3) 


31 30 29 25 24 19 18 14 13 12 5 4 0 
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A-414 


Assembly Language Syntax 


tsubcc reg,.j, leg or imm, reg,q 


Description 


This instruction computes “r[rsl] —r[rs2]"ifi-0,or 
“r[rsl] -sign ext(simm13)"ifi-l. 


TSUBcc modifies the integer condition codes (i cc and xcc). 


A tag overflow condition occurs if bit 1 or bit 0 of either operand is nonzero or if the 
subtraction generates 32-bit arithmetic overflow; that is, the operands have different values in 
bit 31 (the 32-bit sign bit) and the sign of the 32-bit difference in bit 31 differs from bit 31 
of r[rs1]. 


If a TSUBcc causes a tag overflow, the 32-bit overflow bit (CCR.icc.v) is set to one; if 
TSUBcc does not cause a tag overflow, CCR.icc.v is set to zero. 


In either case, the remaining integer condition codes (both the other CCR. icc bits and all 
the CCR.xcc bits) are also updated as they would be for a normal subtract instruction. In 
particular, the setting of the CCR.xcc.v bit is not determined by the tag overflow condition 
(tag overflow is used only to set the 32-bit overflow bit). The CCR. xcc.v setting is based 
only on the normal 64-bit arithmetic overflow condition, like a normal 64-bit subtract. 


Exceptions 


None 
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A.67 


Trap on Integer Condition Codes (Tcc) 




















Opcode op3 cond Operation icc Test 
TA 111010 1000 Trap Always 1 

11 1010 0000 Trap Never 0 
[NE 11 1010 1001 Trap on Not Equal not Z 
TE 11 1010 0001 Trap on Equal Z 
TG 11 1010 1010 Trap on Greater not (Z or (N xor V)) 
TLE 11 1010 0010 Trap on Less or Equal Z or (N xor V) 
TGE 111010 1011 Trap on Greater or Equal not (N xor V) 
TL 111010 0011 Trap on Less N xor V 
TGU 111010 1100 Trap on Greater Unsigned not (C or Z) 
TLEU 111010 0100 Trap on Less or Equal Unsigned (C or Z) 
TEC 11 1010 1101 Trap on Carry Clear (Greater than or Equal, Unsigned) not C 
TCS 11 1010 0101 Trap on Carry Set (Less Than, Unsigned) C 
TPOS 111010 1110 Trap on Positive or zero not N 
TNEG 111010 0110 Trap on Negative N 
TVC 111010 1111 Trap on Overflow Clear not V 
TVS 111010 0111 Trap on Overflow Set V 

Format (4) 
CSI BINE UNO ENEE 





31 30 29 28 25 24 19 18 14 13 12 11 10 765 4 
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A-416 


Assembly Language Syntax 


ta i_or_x_cc, software_trap_number 

tn i_or_x_cc, software_trap_number 

tne i_or_x_cc, software_trap_number (synonym: tn z) 
te i_or_x_cc, software_trap_number (synonym: tz) 
tg i_or_x_cc, software_trap_number 

tle i_or_x_cc, software_trap_number 

tge i_or_x_cc, software_trap_number 

tl i_or_x_cc, software_trap_number 

tgu i_or_x_cc, software_trap_number 

tleu i_or_x_cc, software_trap_number 

tcc i or x cc, software trap number (synonym: t geu) 
tcs i or x cc, software trap number (synonym: t lu) 
tpos i_or_x_cc, software_trap_number 

tneg i_or_x_cc, software_trap_number 

tvc i or x cc, software trap number 

tvs i or x cc, software trap number 

Description 


The Tcc instruction evaluates the selected integer condition codes (icc or xcc) according 
to the cond field of the instruction, producing either a TRUE or FALSE result. If TRUE and 
no higher-priority exceptions or interrupt requests are pending, then a trap instruction 
exception is generated. If FALSE, a trap instruction exception does not occur and the 
instruction behaves like a NOP. 




















The software trap number is specified by the least significant seven bits of 
“r[rsl] +r[rs2]” if i =0, or the least significant seven bits of 
"r[rsl1] +sw_trap_#” ifi=l1. 





When i = 1, bits 7 through 10 are reserved and should be supplied as zeroes by software. 
When i - 0, bits 5 through 10 are reserved, the most significant 57 bits of 
“r[rsl] *r[rs2]" are unused, and both should be supplied as zeroes by software. 


Description (Effect on Privileged State) 


If a trap_instruction traps, 256 plus the software trap number is written into TT [TL]. Then 
the trap is taken, and the processor performs the normal trap entry procedure. 
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Programming Note — Tcc can be used to implement breakpointing, tracing, and calls to 
supervisor software. It can also be used for runtime checks, such as out-of-range array 
indexes, integer overflow, and so on. 





Compatibility Note — Tcc is upward compatible with the SPARC-V8 Ticc instruction, 
with one qualification: a Ticc with i = 1 and simm13 < 0 may execute differently on a 
SPARC-V9 processor. Use of the i = 1 form of Ticc is believed to be rare in SPARC-V8 
software, and simm13 < 0 is probably not used at all; therefore, it is believed in practice, 
that full software compatibility will be achieved. 





Exceptions 


trap_instruction 
illegal instruction (cc1 cc0 - 01; or 115, or reserved fields nonzero) 

















A.68 _ Write Privileged Register 


Opcode op3 Operation 
WRPRP 11 0010 Write Privileged Register 
Format (3) 
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A-418 


CIDA n> 9 MO —- ola 


ao) 
B U Ne c 


15-31 


Privileged Register 


TEC 

TNPC 
TSTATE 
TT 

TICK 

TBA 
PSTATE 
TL 

PIL 

CWP 
CANSAVE 
CANRESTORE 
CLEANWIN 
OTHERWIN 
WSTATE 


Reserved 





Assembly Language Syntax 


wrpr 
wrpr 
wrpr 
wrpr 
wrpr 
wrpr 
wrpr 
wrpr 
wrpr 
wrpr 
wrpr 
wrpr 
wrpr 
wrpr 


wrpr 


reg,,j,Teg or imm, 
reg,,j,Teg or imm, 
reg,,j,Teg or imm, 
reg,,1,Teg or imm, 
reg,,j,Teg or imm, 
reg,,;j,reg or imm, 
reg, reg_or_imm, 
reg,.j reg_or_imm, 
reg, j, reg_or_imm, 
reg, j, reg_or_imm, 
reg,.j reg_or_imm, 
reg, j; reg_or_imm, 
reg, reg_or_imm, 
reg,sy, reg_or_imm, 


reg,,;j,Teg or imm, 
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$tpc 
$tnpc 
$tstate 
$tt 

Stick 
$tba 
Spstate 
Stl 

Spil 

$cwp 
$cansave 
$canrestore 
Scleanwin 
Sotherwin 


Swstate 


Description 


This instruction stores the value “r[rs1] xor r[rs2]” if i =0, or 
*r[rs1] xor sign_ext (simm13)” if i = 1, to the writable fields of the specified 
privileged state register. 





Note — The operation is an exclusive OR. 


The rd field in the instruction determines the privileged register that is written. There are at 
least four copies of the TPC, TNPC, TT, and TSTATE registers, one for each trap level. A 
write to one of these registers sets the register indexed by the current value in the trap level 
register (TL). A write to TPC, TNPC, TT, or TSTATE when the trap level is zero (TL = 0) 
causes an i/legal_ instruction exception. 

















A WRPR of TL does not cause a trap or return from trap; it does not alter any other machine 
state. 


Programming Note — A wRPR of TL can be used to read the values of TPC, TNPC, and 
TSTATE for any trap level; however, make sure that traps do not occur while the TL register 
is modified. 





The WRPR instruction is a non-delayed write instruction. The instruction immediately 
following the WRPR observes any changes made to processor state made by the WRPR. 


WRPR instructions with rd in the range 15—31 are reserved for future versions of the 
architecture; executing a WRPR instruction with rd in that range causes an illegal instruction 
exception. 





Implementation Note — Some WRPR instructions could serialize the processor in some 
implementations. 





Exceptions 


privileged_opcode 
illegal instruction ((cd = 15-31) or ((rd € 3) and (TL = 0))) 
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A.69 Write State Register 











Opcode op3 rd Operation 
wry? 11 0000 0 Write Y register; deprecated (see Section A.70.18, “Write Y 
Register”). 
— 11 0000 1 Reserved, do not access; attempt to access causes an 
illegal_instruction exception. 
WRCCR 11 0000 2 Write Condition Codes Register 
WRASI 11 0000 3 Write Graphics Status Register 
— 11 0000 4,5 Reserved, do not access; attempt to access causes an 
illegal_instruction exception. 
WRFPRS 11 0000 6 Write Floating-Point Registers Status Register 
— 11 0000 7-14 Reserved, do not access; attempt to access causes an 
illegal_instruction exception. 
— 11 0000 15 Software-initiated reset (see Section A.59, “Software- 
Initiated Reset"). 
WRASR 11 0000 16-31 Write non-SPARC-V9 ASRs 
WRPCRPrcr 16 Write Performance Control Registers (PCR) 
WRPICP"c 17 Write Performance Instrumentation Counters (PIC) 
WRDCRP 18 Write Dispatch Control Register (DCR) 
WRGSR 19 Write Graphic Status Register (GSR) 
WRSOFTINT_SETP 20 Set bits of per-processor Soft Interrupt Register 
WRSOFTINT_CLR? 21 Clear bits of per-processor Soft Interrupt Register 
WRSOFTINT? 22 Write per-processor Soft Interrupt Register 
WRTICK_CMPRP 23 Write Tick Compare Register 
WRSTICKP 24 Write System TICK Register 
WRSTICK_CMPRP 25 Write System TICK Compare Register 


— 26-31 Reserved, do not access; attempt to access causes an 
illegal instruction exception. 
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Format (3) 


25 24 19 18 14 13 12 5 4 0 





Assembly Language Syntax 





wr reg, yj, l'eg. or imm, 
wr reg, yj, leg. or imm, 
wr reg, y, l'eg. or imm, 
wr reg,,j,reg or imm, 
wr reg,.y; l'eg. or imm, 
wr reg,.y; l'eg or imm, 
wr reg,„s], l'eg. or imm, 
wr reg,,j, reg or imm, 
wr reg,.j, l'eg or imm, 
wr reg, y, l'eg. or imm, 
wr reg,,j, l'eg. or imm, 
wr reg,,7j,reg or imm, 
WI reg,.y; l'eg. or imm, 
Description 


$ccr 

$asi 

$fprs 

$pcr 

$pic 

$dcr 

$gsr 
$set_softint 
$clear_softint 
$softint 
Stick_cmpr 
Ssys_tick 


Ssys_tick_cmpr 


These instructions store the value “r [rs1] xor r[rs2]” if i =0, or 
“r[rsl] xor sign_ext (simm13)” if i = 1, to the writable fields of the specified state 


register. 





Note — The operation is an exclusive OR. 


WRASR writes a value to the ancillary state register (ASR) indicated by rd. The operation 
performed to generate the value written may be rd dependent or implementation dependent 
(see below). A WRASR instruction is indicated by op = 2, rd = 2 16, and op3 = 30jg. 


The WRASR opcode for rd = 15, rs1 =0, and i = 1 is used for the software-initiated 
reset (SIR) instruction (see Section A.59, “Software-Initiated Reset"). 


Chapter A 


Instruction Definitions A-421 


The WRCCR, WRFPRS, and WRASI instructions are not delayed-write instructions. The 
instruction immediately following a WRCCR, WRFPRS, or WRASIR observes the new value of 
the CCR, FPRS, or ASI register. 


WRFPRS waits for any pending floating-point operations to complete before writing the 
FPRS register. 





WRGSR causes a fp disabled trap if PSTATE.PEF =0 or FPRS.FEF =0. 














WRPIC causes a privileged action exception if PSTATE.PRIV=0 and PCR.PRIV - 1. 








WRPCR causes a privileged opcode exception due to access privilege violation. 


Implementation Note — Ancillary state registers may include, for example, timer, 
counter, diagnostic, self-test, and trap-control registers. 








Compatibility Note — The SPARC-V8 WRIER, WRPSR, WRWIM, and WRTBR instructions 
do not exist in SPARC-V9 because the IER, PSR, TBR, and WIM registers do not exist in 
SPARC-V9. 











Implementation Note — Some WRASR instructions could serialize the processor in some 
implementations. 


Exceptions 


software initiated reset (rd = 15, rs1 =0, and i = 1 only) 
privileged_opcode (WRDCR, WRSOFTINT_SET, WRSOFTINT_CLR, WRSOFTINT, 
WRTICK CMPR, WRSTICK, WRSTICK CMPR, 
and WRPCR) 
illegal instruction (WRASR with rd = 1, 4, 5, 7-14, 26-31; 
WRASR with rd = 15 and rs1 + Oor i # 1) 
privileged action (WRPIC with PSTATE.PRIV = 0 and PCR.PRIV - 1) 
fp disabled (WRGSR with PSTATE.PEF = 0 or FPRS.FEF = 0) 
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A.70 Deprecated Instructions 


The following instructions are deprecated; they are provided only for compatibility with 
previous versions of the architecture. They should not be used in new SPARC-V9 software. 
For each deprecated instruction, another instruction is recommended to be used instead. 
Please see TABLE A-2 for the page number at which you can find a description of the 
preferred instruction. 


A.70.1 Branch on Floating-Point Condition Codes (FBfcc) 


The FBfcc instructions are deprecated. Use the FBP fcc instructions instead. 



























































\Opcode cond Operation fcc Test 
FBAD 1000 Branch Always 1 

FBND 0000 Branch Never 

FBUD 0111 Branch on Unordered 

FBGD 0110 Branch on Greater 

FBUGD 0101 Branch on Unordered or Greater 

FBLD 0100 

FBULD 0011 

FBLGD 0010 Branch on Less or Greater 

FBNED 0001 Branch on Not Egual LorGorU 
FBED 1001 Branch on Egual E 

FBUED 1010 

FBGED 1011 Branch on Greater or Egual 

FBUGED 1100 Branch on Unordered or Greater or Egual EorGorU 
FBLED 1101 Branch on Less or Egual EorL 
FBULED 1110 Branch on Unordered or Less or Egual EorLorU 
FBOD 1111 Branch on Ordered EorLorG 
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Format (2) 


= [= [ae = 


31 30 29 28 25 24 22 21 0 





Assembly Language Syntax 








fba(,a) label 

fbn(,a) label 

fbu(,a) label 

fbg(,a) label 

fbug(,a label 

fbl(,a) label 

fbul(,a label 

fblg(,a label 

fbne(,a label (synonym: fbnz) 
fbe(,a) label (synonym: fbz) 
fbue(,a label 

fbge(,a label 

fbuge(,a) label 

fble(,a label 

fbule(,a) label 

fbo(,a) label 


Programming Note — To set the annul bit for FBfcc instructions, append “, a” to the 
opcode mnemonic. For example, use “fb1,a_ label" In the preceding table, braces around 
“, a” signify that “, a” is optional. 


Description 


Unconditional and Fcc branches are described below: 


Unconditional branches (FBA, FBN) — If its annul field is Zero, an FBN (Branch Never) 
instruction acts like a NOP. If its annul field is one, the following (delay) instruction is 
annulled (not executed) when the FBN is executed. In neither case does a transfer of 
control take place. 
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FBA (Branch Always) causes a PC-relative, delayed control transfer to the address 

“PC +(4x sign_ext (disp22) ),” regardless of the value of the floating-point 
condition code bits. If the annul field of the branch instruction is one, the delay instruction 
is annulled (not executed). If the annul field is zero, the delay instruction is executed. 


Fcc-conditional branches — Conditional FBfcc instructions (except FBA and FBN) 
evaluate floating-point condition code zero (£cc0) according to the cond field of the 
instruction. Such evaluation produces either a TRUE or FALSE result. If TRUE, the branch 
is taken, that is, the instruction causes a PC-relative, delayed control transfer to the 
address “PC + (4 x sign_ext(disp22)).” If FALSE, the branch is not taken. 




















If a conditional branch is taken, the delay instruction is always executed, regardless of the 
value of the annul field. If a conditional branch is not taken and the annul (a) field is one, 
the delay instruction is annulled (not executed). 





Note — The annul bit has a different effect on conditional branches than it does on 
unconditional branches. 





Compatibility Note — Unlike SPARC-V8, SPARC-V9 does not require an instruction 
between a floating-point compare operation and a floating-point branch (FBfcc, FBPfcc). 








If FPRS.FEF =0 or PSTATE. PEF =0, or if an FPU is not present, the FBfcc instruction 
is not executed and instead generates a fp. disabled exception. 














Exceptions 


fp. disabled 


A.70.2 . Branch on Integer Condition Codes (Bicc) 


Use the BPcc instructions in place of Bicc instructions. 
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Opcode cond Operation icc Test 

BaD 1000 Branch Always 1 

BND 0000 Branch Never 0 

BNED 1001 Branch on Not Egual not Z 

BED 0001 Branch on Egual Z 

BGP 1010 Branch on Greater not (Z or (N xor V)) 

BLED 0010 Branch on Less or Equal Z or (N xor V) 

BGED 1011 Branch on Greater or Egual not (N xor V) 

BLD 0011 Branch on Less N xor V 

BGUD 1100 Branch on Greater Unsigned not (C or Z) 

BLEUD 0100 Branch on Less or Equal Unsigned Cor Z 

BccD 1101 Branch on Carry Clear (Greater Than or Equal, Unsigned) not C 

BcsP 0101 Branch on Carry Set (Less Than, Unsigned) C 

BpPosP 1110 Branch on Positive not N 

BNEGP 0110 Branch on Negative N 

BvcD 1111 Branch on Overflow Clear not V 

BvsP 0111 Branch on Overflow Set V 

Format (2) 
[ee Dew [ee 
31 30 29 28 25 24 22 21 0 
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Assembly Language Syntax 





ba(,a) label 

bn(,a) label 

bne(,a) label (synonym: bnz) 
be{,a} label (synonym: bz) 
bg{,a} label 

ble{,a} label 

bge{,a} label 

bl{,a} label 

bgu{,a} label 

bleu{,a} label 

bec{,a} label (synonym: bgeu) 
bcs(,a) label (synonym: blu) 
bpos(,a) label 

bneg(,a) label 

bvc(,a) label 

bvs(,a) label 








Programming Note — To set the annul bit for Bicc instructions, append “, a” to the 
opcode mnemonic. For example, use “bgu, a label.” In the preceding table, braces signify 
that the “, a” is optional. 





Description 


Unconditional branches and icc-conditional branches are described below: 


Unconditional branches (BA, BN) — If its annul field is zero, a BN (Branch Never) 
instruction is treated as a NOP. If its annul field is one, the following (delay) instruction is 
annulled (not executed). In neither case does a transfer of control take place. 


BA (Branch Always) causes an unconditional PC-relative, delayed control transfer to the 
address “PC + (Ax sign ext (disp22) ).” If the annul field of the branch instruction is 
one, the delay instruction is annulled (not executed). If the annul field is zero, the delay 

instruction is executed. 


Icc-conditional branches — Conditional Bicc instructions (all except BA and BN) 
evaluate the 32-bit integer condition codes (i cc), according to the cond field of the 
instruction, producing either a TRUE or FALSE result. If TRUE, the branch is taken, that 
is, the instruction causes a PC-relative, delayed control transfer to the address 

"PC +(4x sign ext (disp22) ).” If FALSE, the branch is not taken. 
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If a conditional branch is taken, the delay instruction is always executed regardless of the 
value of the annul field. If a conditional branch is not taken and the annul (a) field is one, 
the delay instruction is annulled (not executed). 


Note — The annul bit has a different effect on conditional branches than it does on 
unconditional branches. 





Exceptions 


None 


A.70.3 Divide (64-bit / 32-bit) 


The UDIV, UDIVcc, SDIV, and SDIVcc instructions are deprecated. Use the UDIVX and 
SDIVX instructions instead. 





Opcode op3 Operation 

UDIVD 001110 Unsigned Integer Divide 

spivP 00 1111 Signed Integer Divide 

UD ivccD 01 1110 Unsigned Integer Divide and modify condition codes 
SD vec 01 1111 Signed Integer Divide and modify condition codes 
Format (3) 


31 30 29 25 24 19 18 14 13 12 5 4 0 
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Assembly Language Syntax 








udiv reg,.j, l'eg or imm, reg, 
sdiv reg,.j, l'eg or imm, reg, 
udivcc reg, l'eg or imm, reg, 
sdivcc reg,.j, l'eg or imm, reg, 
Description 


The divide instructions perform 64-bit by 32-bit division, producing a 32-bit result. If i — 0, 
they compute “(Y [| r[rs1]<31:0>) + r[rs2]<31:0>.” Otherwise (that is, if i = 1), the 














divide instructions compute “(Y 














r[rs1]<31:0>)—- (sign ext (simm13) <31:0>).” In 


either case, if overflow does not occur, the less significant 32 bits of the integer quotient are 


sign-extended or zero-extended to 64 bits and are written into r [rd]. 


The contents of the Y register are undefined after any 64-bit by 32-bit integer divide 


operation. 


Unsigned Divide 


Unsigned divide (UDIV, UDIVcc) assumes an unsigned integer doubleword dividend 
(Y || z[rs1]<31:0>) and an unsigned integer word divisor r[rs2<31:0>] or 














(sign ext (simm13) <31:0>) and computes an unsigned integer word quotient (r [rd]). 


Immediate values in simm13 are in the ranges 0 to 27 — 1 and 2? — 2" to 277-1 for 


unsigned divide instructions. 


Unsigned division rounds an inexact rational quotient toward zero. 


In the UltraSPARC IIli processor, LDD is implemented in hardware. 
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A-429 


A-430 


Programming Note — The rational quotient is the infinitely precise result quotient. It 
includes both the integer part and the fractional part of the result. For example, the rational 
quotient of 11/4 = 2.75 (integer part = 2, fractional part = .75). 


The result of an unsigned divide instruction can overflow the less significant 32 bits of the 
destination register r [rd] under certain conditions. When overflow occurs, the largest 
appropriate unsigned integer is returned as the quotient in r [rd]. The condition under 
which overflow occurs and the value returned in r [rd] under this condition are specified in 
TABLE A-16. 


TABLE A-16. UDIV / UDIVcc Overflow Detection and Value Returned 








Condition Under Which Overflow Occurs Value Returned in r[rd] 
Rational quotient > 23? 22-1 
(0000 0000 FFFF FFFFjg) 











When no overflow occurs, the 32-bit result is zero-extended to 64 bits and written into 
register r [rd]. 


UDIV does not affect the condition code bits. UDIVcc writes the integer condition code bits 
as shown in the following table. Note that negative (N) and zero (Z) are set according to the 
value of r [rd] after it has been set to reflect overflow, if any. 




















Bit UDIVcc 
icc.N Set if r[rd]<31>=1 
LCC, Set if r [rd]<31:0> =0 
icc.V Set if overflow (per TABLE A-16) 
icc.C Zero 
xcc.N Set if r[rd]<63><-1 
xcc.Z Set if r [rd] «63:0» = 0 
xcc.V Zero 
xcc.C Zero 
Signed Divide 


Signed divide (SDIV, SDIVcc) assumes a signed integer doubleword dividend 

(Y || lower 32 bits of r [rs1]) and a signed integer word divisor (lower 32 bits of r[rs2] 
or lower 32 bits of sign. ext (simm13)) and computes a signed integer word quotient 
(r[rd]). 














Signed division rounds an inexact quotient toward zero. For example, —7 + 4 equals the 
rational quotient of —1.75, which rounds to —1 (not —2) when rounding toward zero. 
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A.70.4 


The result of a signed divide can overflow the low-order 32 bits of the destination register 
r [rd] under certain conditions. When overflow occurs, the largest appropriate signed 
integer is returned as the quotient in r [rd]. The conditions under which overflow occurs 
and the value returned in r [rd] under those conditions are specified in TABLE A-17. 


TABLE A-1417 SDIV / SDIVcc Overflow Detection and Value Returned 





Condition Under Which Overflow Occurs Value Returned in r[rd] 


Rational quotient 2 2?! 23-1 
(0000 0000 7FFF FFFF,9) 





Rational quotient < —2?! - 1 du! 
(FFFF FFFF 8000 0000; ) 


When no overflow occurs, the 32-bit result is sign-extended to 64 bits and written into 
register r [rd]. 


SDIV does not affect the condition code bits. SDIVcc writes the integer condition code bits 
as shown in the following table. Note that negative (N) and zero (Z) are set according to the 
value of r [rd] after it has been set to reflect overflow, if any. 








Bit SDIVcc 
icc.N Setifr[rd]<3l>- 1 
icc.Z Set if r[rd]«31:0- = 0 
icc. V Set if overflow (per TABLE A-17) 
GCC Zero 
xcc.N Set if r [rd] <63>= 1 
žer Set if r [rd] <63:0>=0 
xcc.V Zero 
xcc.C Zero 
Exceptions 


division by zero 


Load Floating-Point Status Register 


The LDFSR instruction is deprecated. Use the LDXFSR instruction instead. 


Opcode op3 rd Operation 
LDFSRP 10 0001 0 Load Floating-Point State Register Lower 
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11 


A-432 


Format (3) 


op3 rs1 i-0 — rs2 
op3 rs1 i=1 simm13 
25 24 19 18 14 13 12 5 4 0 


Assembly Language Syntax 





ld [address], sfsr 





Description 


The load floating-point state register lower instruction (LDF SR) waits for all FPop 
instructions that have not finished execution to complete and then loads a word from memory 
into the less significant 32 bits of the FSR. The upper 32 bits of FSR are unaffected by 
LDFSR. 








LDFSR causes a mem address not aligned exception if the effective memory address is not 
word aligned. 





Compatibility Note — SPARC-V9 supports two different instructions to load the FSR: the 
SPARC-V8 LDFSR instruction is defined to load only the less significant 32 bits of the FSR, 
whereas LDXFSR allows SPARC-V9 programs to load all 64 bits of the FSR. 





Exceptions 


mem, address not aligned 
data access exception 
data access error 


fast data access MMU miss 
fast data access protection 





PA watchpoint 
VA. watchpoint 
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A.70.5 


Load Integer Doubleword 


The LDD instruction is deprecated; it is provided only for compatibility with previous 
versions of the architecture. It should not be used in new SPARC-V9 software. Use the LDX 
instruction instead. 


Please refer to Section A.27, “Load Integer” for the current load integer instructions. 











Opcode op3 Operation 
LDDP 00 0011 Load doubleword 
Format (3) 


= [H I 


31 30 29 


op3 rs1 i=1 simm13 


25 24 19 18 14 13 12 5 4 0 


Assembly Language Syntax 
ldd [address], reg, 


Description 


The load doubleword integer instruction (LDD) copies a doubleword from memory into an 
r register pair. The word at the effective memory address is copied into the even r register. 
The word at the effective memory address + 4 is copied into the following odd-numbered 
r register. The upper 32 bits of both the even-numbered and odd-numbered r registers are 
zero-filled. 
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A.70.6 


A-434 


Notes — A load doubleword with rd = 0 modifies only r [1]. The least significant bit of 
the rd field in an LDD instruction is unused and should be set to zero by software. An 
attempt to execute a load doubleword instruction that refers to a misaligned (odd-numbered) 
destination register causes an illegal instruction exception. 


With respect to little-endian memory, an LDD instruction behaves as if it is composed of two 
32-bit loads, each of which is byte swapped independently before being written into each 
destination register. 





Load integer doubleword instructions access the primary address space (ASI = 8046). The 
effective address is “r[rsl] +r[rs2]” ifi=0, or “r[rs1l] *sign ext (simml3)” 
ifi-l. 


A successful load doubleword instruction operates atomically. 





Programming Note — LDD is provided for compatibility with SPARC-V8. It may execute 
slowly on SPARC-V9 machines because of data path and register-access difficulties. 





Exceptions 


illegal instruction (LDD with odd rd) 
mem address not aligned 

data access exception 

data access error 


fast data access MMU miss 
fast data access protection 





PA watchpoint 
VA. watchpoint 


Load Integer Doubleword from Alternate Space 


The LDDA instruction is deprecated. Use the LDXA instruction in its place. 


Please refer to Section A.28, “Load Integer from Alternate Space" for current load integer 
from alternate space instructions. 


Opcode op3 Operation 
LDDAP: Past 01 0011 Load Doubleword from Alternate Space 
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11 


11 


31 30 29 


rd 


rd 


Format (3) 


op3 rs1 i-0 imm asi rs2 
op3 rs1 iz1 simm13 
25 24 19 18 14 13 12 5 4 0 


Assembly Language Syntax 





ldda [regaddr] imm. asi, reg, 
ldda [reg_plus_imm] %asi, reg,q 
Description 


The load doubleword integer from alternate space instruction (LDDA) copies a doubleword 
from memory into an r register pair. The word at the effective memory address is copied into 
the even r register. The word at the effective memory address + 4 is copied into the 
following odd-numbered r register. The upper 32 bits of both the even-numbered and odd- 
numbered r registers are zero-filled. 





Notes — A load doubleword with rd = 0 modifies only r [1]. The least significant bit of 
the rd field in an LDDA instruction is unused and should be set to zero by software. An 
attempt to execute a load doubleword instruction that refers to a misaligned (odd-numbered) 
destination register causes an illegal instruction exception. 


With respect to little-endian memory, an LDDA instruction behaves as if it is composed of 
two 32-bit loads, each of which is byte-swapped independently before being written into 
each destination register. 





The load integer doubleword from alternate space instructions contain the address space 
identifier (ASI) to be used for the load in the imm asi field if i — 0, or in the ASI register 
if i- 1. The access is privileged if bit 7 of the ASI is zero; otherwise, it is not privileged. 
The effective address for these instructions is “r[rsl] ^ r[rs2]"if i = 0, or 

“r[rsl] sign ext (simm13)” ifi=1. 


A successful load doubleword instruction operates atomically. 


LDDA causes a mem address not aligned exception if the address is not doubleword aligned. 





These instructions cause a privileged action exception if PSTATE . PRIV = 0 and bit 7 of the 
ASI is zero. 
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In the UltraSPARC IIli processor, LDDA is implemented in hardware. 


LDDA with ASI-24,6 or 2C, is defined to be a Load Ouadword Atomic instruction. See 


Section A.29, “Load Quadword, Atomic (VIS I)” for details. 





Programming Note — LDDA is provided for compatibility with SPARC-V8. It may 
execute slowly on SPARC-V9 machines because of data path and register-access difficulties. 


If LDDA is emulated in software, an LDXA instruction should be used for the memory access 
in order to preserve atomicity. 





Exceptions 


privileged action 
illegal instruction (LDDA with odd rd) 


mem address not aligned 


data access exception 





fast data access MMU miss 
fast data access protection 


PA watchpoint 
VA watchpoint 


A.70.7 — Multiply (32-bit) 


The UMUL, UMULcc, SMUL, and SMULcc instructions are deprecated. Use the MULX 


instruction instead. 








Opcode op3 Operation 

UMULD 00 1010 Unsigned Integer Multiply 

SMULD 00 1011 Signed Integer Multiply 

UMULccP 01 1010 Unsigned Integer Multiply and modify condition codes 
SMULccP 01 1011 Signed Integer Multiply and modify condition codes 





A-436 
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Format (3) 


25 24 19 18 14 13 12 5 4 0 


Assembly Language Syntax 


umul reg,.j, l'eg or imm, reg, 
smul reg,,j, leg or imm, reg, 
umulcc reg, reg or imm, reg, 
smulcc reg,.j, leg or imm, reg, 
Description 


The multiply instructions perform 32-bit by 32-bit multiplications, producing 64-bit results. 
They compute “r [rs1]<31:0> x r[rs2]<31:0>” if i =0, or 

“r [rs1]<31:0> x sign ext (simm13) <31:0>” if i = 1. They write the 32 most 
significant bits of the product into the Y register and all 64 bits of the product into r[rd]. 


Unsigned multiply instructions (UMUL, UMULcc) operate on unsigned integer word operands 
and compute an unsigned integer doubleword product. Signed multiply instructions (SMUL, 
SMULcc) operate on signed integer word operands and compute a signed integer doubleword 
product. 


UMUL and SMUL do not affect the condition code bits. UMULcc and SMULcc write the 
integer condition code bits, icc and xcc, as shown in TABLE A-18. 
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Note — Zero (icc. Z) and 32-bit negative (i cc . N) condition codes are set according to the 
less significant word of the product, not according to the full 64-bit result. 





TABLE A-18 UMULcc / SMULcc Condition Code Settings 


Bit UMULcc / SMULcc 
icc.N Set if product<31> = 1 
icc. Set if product<31:0>= 0 
icc.V 0 

icc.C 0 

xcc.N Set if product<63> = 1 
XCA Set if product<63:0> = 0 
xcc.V 0 

Xxcexc 0 





Programming Notes — 32-bit overflow after UMUL/UMULcc is indicated by v #0. 


Y #(x [rd] >> 31) is indicates 32-bit overflow after SMUL/SMULcc, where “>>” indicates 
32-bit arithmetic right-shift. 


Exceptions 


None 


A.70.8 Multiply Step 


The MULScc instruction is deprecated. Use the MULX instruction instead. 


Opcode op3 Operation 
MULSccP 10 0100 Multiply Step and modify condition codes 
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Format (3) 


10 rd op3 rs1 i-0 — rs2 
10 rd op3 rs1 i=1 simm13 
31 30 29 25 24 19 18 14 13 12 5 4 0 





Assembly Language Syntax 





mulscc reg,,y reg or imm, reg, 


Description 


MULScc treats the less significant 32 bits of both x [rs1] and the Y register as a single 64- 
bit, right-shiftable doubleword register. The least significant bit of r [rs1] is treated as if it 
were adjacent to bit 31 of the Y register. The MULScc instruction adds, based on the least 
significant bit of Y. 


Multiplication assumes that the Y register initially contains the multiplier, r[rs1] contains 
the most significant bits of the product, and r [rs2] contains the multiplicand. Upon 
completion of the multiplication, the Y register contains the least significant bits of the 
product. 





Note — A standard MULScc instruction has rs1 = rd. 


MULScc operates as follows: 
1. The multiplicand is r[rs2] ifi =0, or sign_ext (simm13) ifi- |l. 


2. A 32-bit value is computed by shifting r [rs1] right by one bit with 
"CCR.icc.n xor CCR.icc.v" replacing bit 31 of r[rs1]. (This is the proper sign for 
the previous partial product). 


3. If the least significant bit of Y = 1, the shifted value from step (2) and the multiplicand are 
added. If the least significant bit of v — 0, then zero is added to the shifted value from 


step (2). 


4. The sum from step (3) is written into r [rd]. The upper 32 bits of r [rd] are undefined. 
The integer condition codes are updated according to the addition performed in step (3). 
The values of the extended condition codes are undefined. 
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A.70.9 


5. The Y register is shifted right by one bit, with the least significant bit of the unshifted 
r[rs1] replacing bit 31 of Y. 


Exceptions 


None 


Read Y Register 


The RDY instruction from the Read State Register instructions (Section A.50, “Read State 
Register") is deprecated. It is recommended that all instructions that reference the Y register 
be avoided. 








Opcode op3 rs1 Operation 
RDYD 101000 0 Read Y Register 
Format (3) 


a ee ee 


31 30 29 


A-440 


25 24 19 18 14 13 12 0 


Assembly Language Syntax 


rd SY, reg 


Description 


This instruction reads the Y register into r[rd]. 


Exceptions 


None 


UltraSPARC Illi Processor User's Manual * June 2003 


A.70.10 


Store Barrier 


The STBAR instruction is deprecated. Use the MEMBAR instruction instead. 











Opcode op3 Operation 
sTBARD 10 1000 Store Barrier 
Format (3) 


pup 35 ] x Juge 5 


31 30 29 


25 24 19 18 14 13 12 0 


Assembly Language Syntax 
stbar 





Description 


The store barrier instruction (STBAR) forces a// store and atomic load-store operations issued 
by a processor prior to the STBAR to complete their effects on memory before any store or 
atomic load-store operations issued by that processor subsequent to the STBAR are executed 
by memory. 


Note — The encoding of STBAR is identical to that of the RDASR instruction except that 
rs1 = 15 and rd = 0, and it is identical to that of the MEMBAR instruction except that bit 13 
(i) =0. 








Compatibility Note — STBAR is identical in function to a MEMBAR instruction with 
mmask = 8,6. STBAR is retained for compatibility with SPARC-V8. 
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A.70.11 


Implementation Note — For correctness, it is sufficient for a processor to stop issuing 
new store and atomic load-store operations when an STBAR is encountered and to resume 
after all stores have completed and are observed in memory by all processors. More efficient 
implementations may take advantage of the fact that the processor is allowed to issue store 
and load-store operations after the STBAR, as long as those operations are guaranteed not to 
become visible before all the earlier stores and atomic load-stores have become visible to all 
processors. 


Exceptions 


None 


Store Floating-Point Status Register Lower 


The STFSR instruction is deprecated. Use the STXFSR instruction instead. 








Opcode op3 rd Operation 
STFSRD 10 0101 0 Store Floating-Point State Register Lower 
Format (3) 


GIC IL Ic B CIL 


31 30 29 


A-442 


25 24 19 18 14 13 12 5 4 0 





Assembly Language Syntax 


st $fsr, [address] 





Description 


The store floating-point state register lower instruction (STFSR) waits for any currently 
executing FPop instructions to complete, and then it writes the less significant 32 bits of the 
FSR into memory. 
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A.70.12 


Compatibility Note — SPARC-V9 needs two store-F SR instructions, since the SPARC-V8 
STFSR instruction is defined to store only 32 bits of the FSR into memory. STXFSR allows 
SPARC-V9 programs to store all 64 bits of the FSR. 


STFSR zeroes FSR. ftt after writing the FSR to memory. 





Implementation Note — FSR. ftt should not be zeroed until it is known that the store 
will not cause a precise trap. 





The effective address for this instruction is “r [rs1] Fr[rs2]” if i =0, or 
“r[rsl] +sign_ext (simm13)” ifi=1. 


STFSR causes a mem_address_not_aligned exception if the effective memory address is not 
word aligned. 


Exceptions 


illegal instruction (op3 = 25,6 and rd = 2-31) 
fp. disabled 

mem address not aligned 

data access exception 

data access error 

fast data access MMU miss 

fast data access protection 

PA watchpoint 

VA. watchpoint 





Store Integer Doubleword 


The STD instruction is deprecated. Use the STX instruction instead. 


Opcode op3 Operation 
STDP 00 0111 Store Doubleword 
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11 


rd 


Format (3) 


op3 rs1 i-0 — rs2 


31 30 29 


A-444 


25 24 19 18 14 13 12 5 4 0 





Assembly Language Syntax 





std reg,g, [address] 


Description 


The store doubleword integer instruction (STD) copies two words from an r register pair into 
memory. The least significant 32 bits of the even-numbered r register are written into 
memory at the effective address, and the least significant 32 bits of the following odd- 
numbered r register are written into memory at the “effective address + 4.” The least 
significant bit of the rd field of a store doubleword instruction is unused and should always 
be set to zero by software. An attempt to execute a store doubleword instruction that refers to 
a misaligned (odd-numbered) rd causes an i//egal_instruction exception. 


The effective address for this instruction is “r[rsl] +r[rs2]” if i =0, or 
“r[rsl] +sign_ext (simm13)” ifi=1. 


A successful store doubleword instruction operates atomically. 


STD causes a mem_address_not_aligned exception if the effective address is not doubleword 
aligned. 


In the UltraSPARC IIli processor, STD is implemented in hardware. 


Programming Notes — STD is provided for compatibility with SPARC-V8. It may 
execute slowly on SPARC-V9 machines because of data path and register-access difficulties. 
Therefore, programmers should avoid using STD. 


If STD is emulated in software, STX should be used to preserve atomicity. 


With respect to little-endian memory, a STD instruction behaves as if it is composed of two 
32-bit stores, each of which is byte-swapped independently before being written into each 
destination memory word. 
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11 


11 


31 30 29 


Exceptions 


illegal_instruction (STD with odd rd) 
mem_address_not_aligned (all except STB) 
data_access_exception 

data_access_error 

fast data access MMU miss 

fast data access protection 

PA  watchpoint 

VA. watchpoint 





Store Integer Doubleword into Alternate Space 


The STDA instruction is deprecated. Instead, use the STXA instruction. 





Opcode op3 Operation 
STDAD: PASI 01 0111 Store Doubleword into Alternate Space 
Format (3) 
rd op3 rs1 i=0 imm_asi rs2 
rd op3 rs1 i=1 simm13 
25 24 19 18 14 13 12 5 4 0 


Assembly Language Syntax 


stda reg,q, [reg plus imm] $asi 





Description 


The store doubleword integer instruction (STDA) copies two words from an r register pair 
into memory. The least significant 32 bits of the even-numbered r register are written into 
memory at the effective address, and the least significant 32 bits of the following odd- 
numbered r register are written into memory at the "effective address + 4.” The least 
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A-446 


significant bit of the rd field of a store doubleword instruction is unused and should always 
be set to zero by software. An attempt to execute a store doubleword instruction that refers to 
a misaligned (odd-numbered) rd causes an i//egal_instruction exception. 


Store integer doubleword to alternate space instructions contain the address space identifier 
(ASI) to be used for the store in the imm asi field if i = 0, or in the ASI register if i = 1. 
The access is privileged if bit 7 of the ASI is zero; otherwise, it is not privileged. The 
effective address for these instructions is *r[rs1] +r[rs2]” if i =0, or 
“r[rsl]fsign ext(simml13)"ifi- |l. 


A successful store doubleword instruction operates atomically. 


STDA causes a mem address not aligned exception if the effective address is not 
doubleword aligned. 


A store integer into alternate space instruction causes a privileged action exception if 
PSTATE.PRIV - 0 and bit 7 of the ASI is zero. 





In the UltraSPARC Illi processor, STDA is implemented in hardware. 


Programming Note — STDA is provided for compatibility with SPARC-V8. It may 
execute slowly on SPARC-V9 machines because of data path and register-access difficulties. 
Therefore, programmers should avoid using STDA. 





Exceptions 


illegal instruction (STDA with odd rd) 
privileged action 

mem, address not aligned 

data access exception 

data access error 





fast data access MMU miss 
fast data access protection 


PA  watchpoint 
VA watchpoint 


Swap Register with Memory 


The SWAP instruction is deprecated. Use the CASA or CASXA instruction in its place. 





Opcode op3 Operation 
SWAP? 00 1111 Swap Register with Memory 
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11 


rd 


Format (3) 


op3 rs1 i-0 = rs2 


31 30 29 


25 24 19 18 14 13 12 5 4 0 





Assembly Language Syntax 





swap [address], reg, 





Description 


SWAP exchanges the less significant 32 bits of r [rd] with the contents of the word at the 
addressed memory location. The upper 32 bits of r [rd] are set to zero. The operation is 
performed atomically, that is, without allowing intervening interrupts or deferred traps. In a 
multiprocessor system, two or more processors executing CASA, CASXA, SWAP, SWAPA, 
LDSTUB, or LDSTUBA instructions addressing any or all of the same doubleword 
simultaneously are guaranteed to execute them in an undefined, but serial order. 


The effective address for these instructions is “r [rs1] +r[rs2]” ifi = 0, or 
"r[rsl1] *sign ext (simm13)" if i = 1. This instruction causes a 
mem, address not aligned exception if the effective address is not word aligned. 


The coherence and atomicity of memory operations between processors and I/O DMA 
memory accesses are implementation dependent. 





Implementation Note — See Implementation Characteristics of Current SPARC-V9- 
based Products, Revision 9.x, a document available from SPARC International, for 
information on the presence of hardware support for these instructions in the various 
SPARC-V9 implementations. 


Exceptions 


mem, address not aligned 
data access exception 
data access error 


fast data access MMU miss 
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11 


31 30 29 


A-448 


fast_data_access_protection 


PA_watchpoint 
VA_watchpoint 


Swap Register with Alternate Space Memory 


The SWAPA instruction is deprecated. Use the CASXA instruction instead. 








Opcode op3 Operation 
SWAP AD: Past 01 1111 Swap register with Alternate Space Memory 
Format (3) 
op3 rs1 i-0 imm asi rs2 
op3 rs1 i=1 simm13 
25 24 19 18 14 13 12 5 4 0 


Assembly Language Syntax 





swapa [regaddr] imm asi, reg, 
swapa [reg. plus imm] %asi, reg, 
Description 


SWAPA exchanges the less significant 32 bits of r [rd] with the contents of the word at the 
addressed memory location. The upper 32 bits of r [rd] are set to zero. The operation is 
performed atomically, that is, without allowing intervening interrupts or deferred traps. In a 
multiprocessor system, two or more processors executing CASA, CASXA, SWAP, SWAPA, 
LDSTUB, or LDSTUBA instructions addressing any or all of the same doubleword 
simultaneously are guaranteed to execute them in an undefined, but serial order. 


The SWAPA instruction contains the address space identifier (ASI) to be used for the load in 
the imm asi field if i = 0, or in the ASI register if i = 1. The access is privileged if bit 7 
of the ASI is zero; otherwise, it is not privileged. The effective address for this instruction is 
“r[rsl] + r[rs2]” ifi=0, or “r[rsl] + sign ext (simm13)"ifi- |l. 
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This instruction causes a mem_address_not_aligned exception if the effective address is not 
word aligned. It causes a privileged action exception if PSTATE.PRIV = 0 and bit 7 of the 
ASI is zero. 





The coherence and atomicity of memory operations between processors and I/O DMA 
memory accesses are implementation dependent. 


Implementation Note — See Implementation Characteristics of Current SPARC-V9- 
based Products, Revision 9.x, a document available from SPARC International, for 
information on the presence of hardware support for this instruction in the various 
SPARC-V9 implementations. 


Exceptions 


mem address not aligned 
privileged action 

data access exception 

data access error 

fast data access MMU miss 
fast data access protection 
PA, watchpoint 

VA. watchpoint 





A.70.16 Tagged Add and Trap on Overflow 


The TADDccTV instruction is deprecated. Use the TADDcc followed by BPVS instruction 
(with instructions to save the pre-TADDcc integer condition codes if necessary). 


Opcode op3 Operation 
TADDccTVD 10 0010 Tagged Add and modify condition codes, or Trap on Overflow 
Format (3) 
10 rd op3 rs1 i-0 — rs2 
31 30 29 25 24 19 18 14 13 12 5 4 0 
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Assembly Language Syntax 


taddcctv reg,,j, leg or imm, reg, 


Description 


This instruction computes a sum that is “r [rs1] *r[rs2]"ifi-0,or 
"r[rsl1] sign ext (simm13)” ifi=1. 


TADDccTV modifies the integer condition codes if it does not trap. 


A tag overflow exception occurs if bit 1 or bit 0 of either operand is nonzero or if the 
addition generates 32-bit arithmetic overflow (that is, both operands have the same value in 
bit 31 and the sum of bit 31 is different). 


If TADDccTV causes a tag overflow, a tag overflow exception is generated and r [rd] and 
the integer condition codes remain unchanged. If a TADDccTV does not cause a tag overflow, 
the sum is written into xr [rd] and the integer condition codes are updated. CCR. icc.v is 
set to zero to indicate no 32-bit overflow. 


In either case, the remaining integer condition codes (both the other CCR. icc bits and all 
the CCR.xcc bits) are also updated as they would be for a normal ADD instruction. In 
particular, the setting of the CCR.xcc.v bit is not determined by the tag overflow condition 
(tag overflow is used only to set the 32-bit overflow bit). CCR. xcc.v is set, based only on 
the normal 64-bit arithmetic overflow condition, like a normal 64-bit add. 


Compatibility Note — TADDccTV traps based on the 32-bit overflow condition, just as in 
SPARC-V8. Although the tagged add instructions set the 64-bit condition codes CCR. xcc, 
there is no form of the instruction that traps the 64-bit overflow condition. 


Exceptions 


tag_overflow 


Tagged Subtract and Trap on Overflow 


The TSUBccTV instruction is deprecated. Use the TSUBcc instruction followed by BPVS 
(with instructions to save the pre-TSUBcc integer condition codes if necessary). 
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Opcode op3 Operation 





TSUBccTVD 10 0011 Tagged Subtract and modify condition codes, or Trap on Overflow 


Format (3) 


25 24 19 18 14 13 12 5 4 0 





Assembly Language Syntax 


tsubcctv reg, reg or imm, reg, 





Description 


This instruction computes *r[rs1] —r[rs2]"ifi-0,or 
"r[rsl1]- sign ext (simm13)” if i=1. 


TSUBccTV modifies the integer condition codes (icc and xcc) if it does not trap. 


A tag overflow occurs if bit 1 or bit 0 of either operand is nonzero or if the subtraction 
generates 32-bit arithmetic overflow; that is, the operands have different values in bit 31 (the 
32-bit sign bit) and the sign of the 32-bit difference in bit 31 differs from bit 31 of r[rs1]. 


If TSUBccTV causes a tag overflow, then a tag overflow exception is generated and r [rd] 
and the integer condition codes remain unchanged. If a TSUBccTV does not cause a tag 
overflow condition, the difference is written into r [rd] and the integer condition codes are 
updated. CCR. icc.v is set to zero to indicate no 32-bit overflow. 


In either case, the remaining integer condition codes (both the other CCR. icc bits and all 
the CCR.xcc bits) are also updated as they would be for a normal subtract instruction. In 
particular, the setting of the CCR. xcc . v bit is not determined by the tag overflow condition 
(tag overflow is used only to set the 32-bit overflow bit). CCR. xcc. v is set, based only on 
the normal 64-bit arithmetic overflow condition, like a normal 64-bit subtract. 
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Compatibility Note — TSUBccTV traps are based on the 32-bit overflow condition, just 
as in SPARC-V8. Although the tagged-subtract instructions set the 64-bit condition codes 
CCR.xcc, there is no form of the instruction that traps on 64-bit overflow. 


Exceptions 


tag_overflow 


Write Y Register 


The WRY instruction is deprecated. It is recommended that all instructions that reference the 
Y register be avoided. 


op3 rd Operation 
11 0000 0 Write Y register 
11 0000 1-31 See Section A.69, “Write State Register” 





Format (3) 


A-452 


25 24 19 18 14 13 12 5 4 0 


Assembly Language Syntax 


wr reg, y; reg or imm,Sy 


Description 


This instruction stores the value *r[rs1] xor r[rs2]” if i =0, or 
“r[rsl] xor sign_ext (simml3)” if i = 1, to the writable fields of the Y register. 


Note — The operation is an exclusive OR. 
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The WRY instruction is not a delayed-write instruction. The instruction immediately 
following a WRY observes the new value of the Y register. 


Exceptions 


None 
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integer registers 434, 435 
quadword 137, 137 
word 137, 137 
alternate address space 380 
alternate global registers 76 
alternate globals enable (AG) field of PSTATE register 
76, 110 
alternate space instructions 92 
ancillary state registers (ASRs) 90—?? 
access 94 
number 90 
possible registers included 390, 422 
writing to 421 
AND instruction 335 
ANDcc instruction 335 
ANDN instruction 335 
ANDNcc instruction 335 
annul bit 
in branch instructions 284 
in conditional branches 425 
in control transfer instruction 93 
annulled branches 284 
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application program xxxi, 127 
Architectural Register File (ARF) 46 
architecture, meaning for SPARC V9 xxviii 
ARF (Architectural Register File) 46 
arithmetic overflow 91 
ARRAY 16 instruction 271 
ARRAY32 instruction 271 
ARRAYS instruction 271 
ASI 
_BLK. COMMIT PRIMARY 206 
_BLK. COMMIT SECONDARY 206 
. NUCLEUS QUAD LDD. S 326 
atomic access 192 
nonfaulting 192 
restricted 184 
UItraSPARC III internal 195 
ASI, AS IF USER, PRIMARY 191 
ASI AS IF USER, PRIMARY LITTLE 191 
ASI, AS IF USER, SECONDARY 191 
ASI, AS IF USER. SECONDARY LITTLE 191 
ASI. DCU. CONTROL REGISTER 127 
ASI. INTR. DISPATCH, STATUS 216, 220 
ASI. INTR. DISPATCH, STATUS.BUSY bit 216 
ASI. INTR. DISPATCH, STATUS.NACK bit 216 
ASI. INTR. DISPATCH. W 219 
ASI. INTR, RECEIVE 217, 221 
ASI. INTR, W 216, 219 
ASI, NUCLEUS 191 
ASI NUCLEUS LITTLE 191 
ASI PHYS USE EC 191 
ASI PHYS USE EC LITTLE 191 
ASI. PRIMARY 138, 191 
ASI PRIMARY LITTLE 110, 191 
ASI PRIMARY NO FAULT 192 
ASI PRIMARY NO FAULT LITTLE 192 
ASI. PST16 P 359 
ASI PST16 PL 359 
ASI PST16 S 359 
ASI. PST16 SL 359 
ASI. PST32 P 359 
ASI. PST32 PL 360 
ASI. PST32 S 359 
ASI. PST32. SL 360 
ASI PSTS8 P 359 
ASI PST8 PL 359 
ASI PST8 S 359 
ASI PST8 SL 359 
ASI. SDB INTR 218, 221 
ASI SDB INTR R 217 


ASI SECONDARY 191 
ASI SECONDARY LITTLE 191 
ASI SECONDARY NO FAULT 192 
ASI SECONDARY NO FAULT LITTLE 192 
ASRs 
grouping rules 46 
async. data error exception 320, 326, 330, 436 
atomic 
load quadword 326 
memory operations 327 
store doubleword instruction 444, 446 
store instructions 408, 410 
atomic instructions 
compare and swap 191 
LDSTUB 191 
mutual exclusion support 191 
and store queue 197 
SWAP 191 
use with ASIs 191 
atomic load-store instructions 292 
compare and swap 291 
load-store unsigned byte 329, 447, 448 
load-store unsigned byte to alternate space 330 
swap r register with alternate space memory 448 
swap r register with memory 292, 446 


B 
B pipeline stage 37 
BA instruction 426, 427 
BCC instruction 426 
BCS instruction 426 
BE instruction 426 
BG instruction 426 
BGE instruction 426 
BGU instruction 426 
Bicc instructions 92, 93, 425 
big-endian 
swapping in partial store instructions 361 
big-endian byte order 110, 136, 137, 177 
bit vector concatenation xxix 
BLE instruction 426 
BLEU instruction 426 
block 
load and store instructions 
compliance across UItraSPARC platforms 339 
data size (granularity) 194 
E-cache access counting 239 
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load instruction 194 
grouping 47 
ordering 278 
and store queue 197 
load instructions 81, 203, 206, 274 
operations and memory model 278 
overlapping stores 278 
store instruction 
data size (granularity) 194 
grouping 47 
ordering 278 
and PDIST 45 
use in loops 279 
store instructions 81, 203, 274 
use in loops 279 
BMASK instruction 282 
and BSHUFFLE instruction 283 
and MS pipeline 283 
grouping rules 45 
BN instruction 426, 427 
BNE instruction 426 
BNEG instruction 426 
BPA instruction 288 
BPCC instruction 288 
BPcc instructions 92, 93, 174, 175, 288 
BPCS instruction 288 
BPE instruction 288 
BPG instruction 288 
BPGE instruction 288 
BPGU instruction 288 
BPL instruction 288 
BPLE instruction 288 
BPLEU instruction 288 
BPN instruction 288 
BPNE instruction 288 
BPNEG instruction 288 
BPOS instruction 426 
BPPOS instruction 288 
BPr instructions 93, 174, 175, 283 
BPVC instruction 288 
BPVS instruction 288 
BR pipeline 37 
branch 
annulled 284 
delayed 177 
elimination 140, 141 
fcc-conditional 287, 425 
icc-conditional 427 
prediction bit 284 


unconditional 287, 289, 424, 427 
branch if contents of integer register match condition 
instructions 283 
branch instructions, conditional 39 
branch on floating-point condition codes instructions 423 
branch on floating-point condition codes with prediction 
instructions 285 
branch on integer condition codes instructions, See Bicc 
instructions 
branch on integer condition codes with prediction (BPcc) 
instructions 288 
branch prediction 
in B pipeline stage 37 
mispredict signal 39 
statistics for taken/untaken 234 
Branch Predictor (BP) 36 
break-after, definition 41 
break-before, definition 41 
BRGEZ instruction 283 
BRGZ instruction 283 
BRLEZ instruction 283 
BRLZ instruction 283 
BRNZ instruction 283 
BRZ instruction 283 
BSHUFFLE instruction 282 
and BMASK instruction 283 
fully pipelined 283 
grouping rules 45 
bubble, vs. helper 46 
bubbles 234 
BUSY/NACK pairs 220 
BVC instruction 426 
BVS instruction 426 
byte 
addressing 179 
data format 59 
order 136, 137, 177 
order, big-endian 110, 136 
order, implicit 110 
order, little-endian 110, 136 
byte mask 
grouping 283 
byte ordering 361 


C 
C pipeline stage 39, 40 
cache 
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coherency protocol 185 
flushing 205 
level 1 199 
level 2 201 
organization 199 
cacheable accesses 
indication 185 
properties 185 
CALL instruction 
description 290 
destination register 93 
displacement 155 
does not change CWP 80 
and JMPL instruction 318 
writing address into r[15] 76 
CANRESTORE register 114 
CANSAVE register 114 
carry (C) bit of condition fields of CCR 91 
CAS(X)A instruction 191 
CASA instruction 142, 291, 329, 331, 447, 448 
CASXA instruction 142, 291, 329, 331, 447, 448 
cc0 field of instructions 174, 287, 288, 301, 353 
ccl field of instructions 174, 287, 288, 301, 353 
cc2 field of instructions 174, 353 
CCR, See condition codes (CCR) register 
clean register window 115, 392 
clean windows (CLEANWIN) register 114, 114, 385, 
418 
clean window exception 114, 393, 394 
CLEAR SOFTINT pseudo-register 223 
clock-tick register (TICK) 102, 103, 385, 418 
code 
kernel 222 
nucleus 222 
coherence 
domain 185 
unit of 185 
compare and swap instructions 291 
comparison instruction 144, 412 
complex calculations, fixed data format 71 
concatenation of bit vectors xxix 
cond field of instructions 174, 287, 288, 345, 353, 424, 
427 
condition codes 293 
adding 413 
extended integer (Xcc) 92 
floating-point 425 
icc field 91 
integer 90 


results of integer operation (icc) 92 
subtracting 412, 414 
trapping on 416 
xcc field 91 
condition codes (CCR) register 90, 105, 268, 294, 422, 
439 
conditional branch instructions 39 
conditional branches 287, 425, 427 
conditional move instructions 
grouping rules 48 
const22 field of instructions 316 
constants, generating 397 
control and status registers 90 
control-transfer instructions (CTIs) 154, 294 
conventions 
font xxviii 
notational xxix 
conversion 
between floating-point formats instructions 304 
floating-point to integer instructions 302 
integer to floating-point instructions 306 
planar to packed 378 
CTI queue 37 
current exception (cexc) field of FSR register 121, 122, 
123, 124, 125, 126, 147 
current window pointer (CWP) register 
and CALL/JMPL instructions 80 
and clean windows 115 
definition xxxii 
and FLUSHW instruction 315 
function 114 
incremented/decremented 78, 393 
and overlapping windows 78 
range of values 114 
reading CWP with RDPR instruction 385 
and RESTORE instruction 154, 393 
restored during DONE or RETRY 294 
and SAVE instruction 154, 393 
and TSTATE Register 105 
writing CWP with WRPR instruction 418 
current little endian (CLE) field of PSTATE register 
110, 110 
cycles accumulated, count 233 


D 
D pipeline stage 40, 234 
d16hi field of instructions 174, 284 
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d16lo field of instructions 174, 284 
data 
formats 
byte 59 
doubleword 59 
extended word 59 
halfword 59 
quadword 59 
tagged word 59 
word 59 
types 
floating-point 59 
signed integer 59 
unsigned integer 59 
width 59 
watchpoint 
behavior 132 
exception 360 
physical address 133 
register format 133 
virtual address 133 
Data Cache 199 
flush 205 
data cache 
and block load/store 277 
Data Cache Unit Control Register, See DCUCR 
data_access_error exception 281, 293, 320, 324, 328, 
330, 331, 361, 402, 405, 407, 409, 410, 434, 443, 
445, 446 
data_access_exception exception 110, 220, 293, 320, 
322, 330, 331, 405, 407, 409, 410, 445, 446, 447, 449 
data_access_exception exception 185, 191, 192, 194 
data_access_protection exception 281, 324, 326, 328, 
361, 402, 434, 436 
DB_PA field of PA Data Watchpoint register 133 
DC_wr 238 
DC_wr_miss 238 
DCR 
branch and return control 95 
fields 
BPE (branch prediction enable) 95 
MS (multiscalar dispatch enable) 96 
RPE (return address prediction enable) 96 
SI (single issue disable) 96 
IFPOE field 96 
instruction dispatch control 96 
layout 95 
DCUCR 
access data format 128 


DC (data cache enable) field 130 
DM (DMMU enable) field 129 
IC (I-cache enable) field 196 
IC (instruction cache enable) field 130 
IMI (IMMU enable) field 129 
PM (PA data watchpoint mask) field 130 
PR/PW (PA watchpoint enable) fields 131 
VM (VA data watchpoint mask) field 131 
VR/VW (VA data watchpoint enable) fields 131 
watchpoint byte masks/enable bits 132 
deferred trap 
queue, floating-point (FQ) 385 
delay instruction 93, 154, 284, 287, 290, 294, 391, 425 
delayed branch 177 
delayed control transfer 93, 284 
deprecated instructions 
BA 426 
BCC 426 
BCS 426 
BE 426 
BG 426 
BGE 426 
BGU 426 
Bicc 425 
BLE 426 
BLEU 426 
BN 426 
BNE 426 
BNEG 426 
BPOS 426 
BVC 426 
BVS 426 
FBA 423 
FBE 423 
FBG 423 
FBGE 423 
FBL 423 
FBLE 423 
FBLG 423 
FBN 423 
FBNE 423 
FBO 423 
FBU 423 
FBUE 423 
FBUGE 423 
FBUL 423 
FBULE 423 
LDD 433 
LDDA 434 
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LDFSR 431 

MULScc 90, 438 

RDY 90, 388, 440 

SDIV 90, 428 

SDIVcc 90, 428 

SMUL 90, 436 

SMULcc 90, 436 

STD 443 

STDA 445 

STFSR 442 

SWAP 446 

SWAPA 448 

TSUBccTV 449, 451 

UDIV 90, 428 

UDIVcc 90, 428 

UMUL 90, 436 

UMULcc 90, 436 

WRY 90, 420, 452 
disp19 field of instructions 174, 287, 288 
disp22 field of instructions 175, 424, 427 
disp30 field of instructions 175, 290 
Dispatch_rs_mispred 235 
Dispatch0_2nd br 235 
Dispatch0_br_target 235 
divide instructions 357, 428 
divide-by-zero mask (DZM) bit of TEM field of FSR 

register 124 
division_by_zero exception 144, 358 
division-by-zero accrued (dza) bit of aexc field of FSR 

register 127 
division-by-zero current (dzc) bit of cexc field of FSR 

register 127 
DONE instruction 92, 109, 294 

after internal store to ASI 196 

and BST 278 

exiting RED state 25, 249 

grouping rules 47 

restoring AG, IG, MG bits 109 

target address 155 

when TSTATE uninitialized 25, 250 
doublet xxxii 
doubleword 

addressing 180 

alignment 137 

data format 59 

definition xxxii 

in memory 76 
D-SFAR register 

exception address (64-bit) 112 


D-TLB 
access 39 


E 
E pipeline stage 38 
EC. ic miss 240 
EC. misses 239 
E-cache 203 
EDGE16 instruction 295 
EDGEI6L instruction 295 
EDGEI6LN instruction 295 
EDGEI6N instruction 295 
EDGE32 instruction 295 
EDGE32L instruction 295 
EDGE32LN instruction 295 
EDGE32N instruction 295 
EDGES instruction 295 
EDGESL instruction 295 
EDGESLN instruction 295 
EDGESN instruction 295 
emulating multiple unsigned condition codes 141 
enable floating-point (FEF) field of FPRS register 94, 
111, 146, 287, 319, 321, 405, 407, 425 
enable floating-point (PEF) field of PSTATE register 94, 
111, 146, 287, 319, 321, 405, 407, 425 
Error Enable Register 
NCEEN field 195 
error state 
and watchdog reset 251 
error state, and watchdog reset 26 
exceptions 
async. data, error 320, 326, 330, 436 
clean window 114, 393, 394 
data, access. error 281, 293, 320, 324, 328, 330, 331, 
361, 402, 405, 407, 409, 410, 434, 443, 445, 446 
data access exception 293, 320, 322, 330, 331, 405, 
407, 409, 410, 445, 446, 447, 449 
data access protection 281, 324, 326, 328, 361, 402, 
434, 436 
division by. zero 144, 358 
fill n. normal 392, 394 
fill n other 392, 394 
fp. disabled 94, 146, 287, 300, 304, 306, 307, 309, 
311, 319, 320, 321, 322, 348, 350, 355, 405, 407, 
425, 432, 443 
fp exception ieee 754 119, 124, 125, 126, 300, 304, 
306, 307, 311 
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fp. exception other 83, 176, 299, 300, 302, 304, 306, FBE instruction 423 











307, 309, 311, 313, 350 FBfcc instructions 93, 118, 146, 423, 425 
illegal instruction 76, 105, 176, 285, 290, 295, 317, FBG instruction 423 
320, 355, 357, 379, 386, 387, 389, 395, 405, 407, FBGE instruction 423 
417, 419, 434, 435, 436, 443, 444, 445, 446 FBL instruction 423 
LDDF mem, address not aligned 137, 319, 321, 322 FBLE instruction 423 
mem, address not aligned 137, 293, 318, 319, 320, FBLG instruction 423 
322, 323, 324, 325, 326, 391, 392, 405, 407, 409, FBN instruction 423, 424 
410, 434, 436, 443, 445, 446, 447, 449 FBNE instruction 423 
privileged action 92, 138, 293, 321, 322, 325, 326, FBO instruction 423 
331, 389, 390, 407, 410, 435, 436, 446, 449 FBPA instruction 285, 287 
privileged opcode 295, 387, 395, 419 FBPcc instructions 174 
spill n normal 316, 394 FBPE instruction 285 
spill n other 316, 394 FBPfcc instructions 93, 118, 146, 174, 175, 285, 425 
STDF. mem address not aligned 137, 405, 407 FBPG instruction 285 
tag overflow 143, 413, 414, 450, 452 FBPGE instruction 285 
trap instruction 416, 417 FBPL instruction 285 
window. fill 115, 391 FBPLE instruction 285 
window. spill 115 FBPLG instruction 285 
extended word addressing 180 FBPN instruction 285, 287 
extended word data format 59 FBPNE instruction 285 
Externally Initiated Reset (XIR) 251 FBPO instruction 285 
FBPU instruction 285 
FBPUE instruction 285 
FBPUG instruction 285 
F FBPUGE instruction 285 
F pipeline stage 36 FBPUL instruction 285 
FABSd instruction 308 FBPULE instruction 285 
FABSg instruction 308 FBU instruction 423 
FABSs instruction 308 FBUE instruction 423 
FADD instruction 299 FBUG instruction 423 
fadd of numbers with opposite signs 119 FBUGE instruction 423 
FADDd instruction 298 FBUL instruction 423 
FADDg instruction 298 FBULE instruction 423 
FADDs instruction 298 fcc-conditional branches 287, 425 
FALIGNADDR instruction FCMP* instructions 118, 119, 300 
grouping rules 45 FCMPd instruction 300 
FALIGNDATA instruction 269 FCMPE* instructions 118, 119, 300 
grouping rules 45 FCMPEd instruction 300 
FAND instruction 332 FCMPEQ instruction 370 
FANDNOTI instruction 332 FCMPEg instruction 300 
FANDNOTIS instruction 332 FCMPEQ16 instruction 369 
FANDNOT? instruction 333 FCMPEO32 instruction 369 
FANDNOT2S instruction 333 FCMPEs instruction 300 
FANDS instruction 332 FCMPG instruction 370 
fast data access. MMU miss exception 110 FCMPGT!16 instruction 369 
fast_data_access_protection exception 110 FCMPGT32 instruction 369 
fast_instruction_access_MMU_miss exception 110 FCMPL instruction 370 
FBA instruction 423, 425 FCMPLE 16 instruction 369 
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FCMPLE32 instruction 369 floating-point multiply and divide instructions 310 


FCMPNE instruction 370 floating-point operate (FPop) instructions 120, 124, 146, 
FCMPNE 16 instruction 369 175, 432 
FCMPNE32 instruction 369 floating-point registers 83 
FCMPg instruction 300 floating-point registers state (FPRS) register 93, 389, 422 
FCMPs instruction 300 floating-point sguare root instructions 312 
fcn field of instructions 294 floating-point state (FSR) register 117, 124, 125, 127, 
FDIVd instruction 310 405, 432, 442, 443 
FDIVg instruction 310 floating-point trap type (ftt) field of FSR register 125 
FDIVs instruction 310 floating-point trap type (fit) field of FSR register 117, 
FdMULg instruction 310 120, 124, 147, 405, 443 
FdTOi instruction 302, 304 floating-point trap types 
FdTOg instruction 304 IEEE 754 exception 121, 122, 124, 125, 127 
FdTOs instruction 304 invalid fp register 83, 121, 309, 313 
fdtos instruction 120 numeric values 121 
FdTOx instruction 302, 304 sequence error 121 
FEXPAND instruction 151, 372, 377 unfinished FPop 121, 122, 127, 299, 311 
FEXPAND instruction, pixel formatting 373 unimplemented FPop 121, 127, 300, 302, 304, 306, 
FEXPAND operation 377 307, 311, 348, 350 
FFA (f.p./Graphics ALU) pipeline 37 floating-point traps 
FFA pipeline 244 precise 387 
FGA pipeline xxxiii, 283 FLUSH instruction 313 
FGM (F.p./Graphics multiply) pipeline 37 after internal store 196 
FGM pipeline xxxiii, 244 grouping rule 47 
fill register window 78, 154, 393, 395 memory ordering control 187 
fill n. normal exception 392, 394 self-modifying code 314 
fill n. other exception 392, 394 flush register windows instruction 315 
FiTOd instruction 306 flushing 
FiTOq instruction 306 TLB 209 
FiTOs instruction 306, 307 FLUSHW instruction 153, 315 
fixed-point scaling 364 FLUSHW instruction, grouping rule 46 
floating point FMOVA instruction 343 
divide/square root 45 FMOVcc instruction 343 
grouping rules ??-45 FMOVcc instructions 92, 118, 140, 174, 175, 343, 348, 
latencies 44 355 
operation statistics 244 grouping rules 48 
register file access 39 FMOVCS instruction 343 
store instructions 45 FMOVd instruction 308 
subnormal value generation 119 FMOVDcc instruction 345 
floating point complex calculations 71 FMOVE instruction 343 
floating-point add and subtract instructions 298 FMOVEA instruction 344 
floating-point compare instructions 118, 119, 300, 300 FMOVFE instruction 344 
floating-point condition code bits 425 FMOVFG instruction 344 
floating-point condition codes (fcc) fields of FSR register FMOVFGE instruction 344 
118, 121, 122, 287, 301, 425 FMOVFL instruction 344 
floating-point data type 59 FMOVFLE instruction 344 
floating-point deferred-trap gueue (FO) 385 FMOVELG instruction 344 
floating-point exception 120 FMOVEN instruction 344 
floating-point move instructions 308 FMOVENE instruction 344 
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MOVFO instruction 344 

MOVFU instruction 344 
MOVFUE instruction 344 
MOVFUG instruction 344 
MOVFUGE instruction 344 
MOVFUL instruction 344 
MOVFULE instruction 344 

OVG instruction 343 

MOVGE instruction 343 

MOVGU instruction 343 

OVL instruction 343 

MOVLE instruction 343 
MOVLEU instruction 343 

OVN instruction 343 

OVNE instruction 343 
MOVNEG instruction 343 
MOVPOS instruction 343 

OVg instruction 308 

OVOcc instruction 345 

OVr instructions 175, 349 
MOVRGEZ instruction 349 
MOVRGZ instruction 349 
MOVRLEZ instruction 349 
MOVRLZ instruction 349 
MOVRNZ instruction 349 

OVRZ instruction 349 

OVs instruction 308 

MOVScc instruction 345 
MOVVC instruction 343 

MOVNVS instruction 343 
UL8SUx16 instruction 363, 366 
UL8ULx16 instruction 363, 367 
UL8x16 instruction 152, 363, 364 
UL8x16AL instruction 363, 365 
UL8x16AU instruction 363, 365 
ULd instruction 310 
ULDSSUXx16 instruction 363, 367 
ULD8ULx16 instruction 363, 368 
ULg instruction 310 

ULs instruction 310 

AND instruction 332 

ANDS instruction 332 

EGd instruction 308 

EGg instruction 308 

'NEGs instruction 308 

NOR instruction 332 

NORS instruction 332 

NOTI instruction 332 

NOTIS instruction 332 


E 


<< < 


<< 


<< 





ZZ 








FNOT? instruction 332 

FNOT2S instruction 332 

FONE instruction 332 

FONES instruction 332 

FOR instruction 332 

formats, instruction 171 

FORNOTI instruction 332 

FORNOTIS instruction 332 

FORNOT2 instruction 332 

FORNOT2S instruction 332 

FORS instruction 332 

fp. disabled exception 94, 96, 146, 287, 300, 304, 306, 
307, 309, 311, 319, 320, 321, 322, 348, 350, 355, 
402, 405, 407, 425, 432, 443 

fp. disabled trap 98 

fp. exception exception 124 

fp. exception ieee 754 "invalid" exception 303 

[p exception ieee 754 exception 97, 119, 124, 125, 126, 
300, 304, 306, 307, 311 

fp. exception other exception 83, 97, 119, 122, 147, 176, 
299, 300, 302, 304, 306, 307, 309, 311, 313, 350 

FPACK instructions 151—??, 372-377 

FPACK, performance usage 373 

FPACK 16 instruction 151, 372, 373 

FPACK16 operation 374 

FPACK32 instruction 372, 375 

FPACK32 operation 375 

FPACKFIX instruction 372, 376 

FPACKFIX operation 377 

FPADD 16 instruction 361 

FPADDI6S instruction 361 

FPADD32 instruction 361 

FPADD32S instruction 361 

FPMERGE instruction 372, 378 

FPMERGE instruction, back-to-back execution 373 

FPRS 
.FEF 98 

FPRS register 
description 93 
FEF field 97, 422 

FPSUBI6 instruction 361 

FPSUBI6S instruction 361 

FPSUB32 instruction 361 

FPSUB32S instruction 362 

FqTOd instruction 304 

FgTOi instruction 302 

FqTOs instruction 304 

FgTOx instruction 302 

FsMULd instruction 310 
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FSQRTd instruction 312 
FSQRTq instruction 312 
FSQRTs instruction 312 
FSR 

ftt field 119 


nonstandard floating-point operation 119 


NS field 119 

=1119 

—0 299 

=] 299 
FSRCI instruction 332 
FSRCIS instruction 332 
FSRC2 instruction 332 
FSRC2S instruction 332 
FsTOd instruction 304 
FsTOi instruction 302, 304 
FsTOg instruction 304 
FsTOx instruction 302, 304 
FSUB instruction 299 
fsub of numbers with the same signs 120 
FSUBd instruction 298 
FSUBg instruction 298 
FSUBs instruction 298 
FXNOR instruction 332 
FXNORS instruction 332 
FXOR instruction 332 
FXORS instruction 332 
Fx TOd instruction 306, 307 
FxTOg instruction 306 
FxTOs instruction 306, 307 
FZERO instruction 332 
FZEROS instruction 332 


G 


generating constants 397 
global registers 
interrupt 109 
trap 109 
global registers 74, 76, 76 
global visibility 186 
graphics data format 
fixed 16-bit 71 
Graphics Status Register 
format 98 
grouping rules 41-45 
BMASK and BSHUFFLE 283 
SIAM instruction 396 


466 


GSR 
fields 
ALIGN 99 
IM (interval mode) field 98 
IRND (rounding) 99 
MASK 98 
SCALE 99 
format 98 
mask, setting before BSHUFFLE 283 
write instruction latency 45 


H 
halfword 
addressing 179 
alignment 137 
data format 59 
hardware 
interlocking mechanism 340 
helper 
cycle 43 
execution order 43 
generation 43 
in pipelines 43 


I 


i field of instructions 175, 268, 313, 315, 317, 319, 321, 
323, 325, 329, 330, 336, 353, 356, 358, 379, 389, 


391, 429, 432, 433, 435, 437, 439, 440 
I pipeline stage 37 
I/D Translation Storage Buffer Register 
differences from UltraSPARC-I 210 
I/O 
access 194, 196 
memory 184 
memory-mapped 185 
noncacheable address 191 
IC miss 237 
IC miss cancelled 237 


icc field of CCR register 90, 92, 268, 290, 336, 355, 412, 


413, 416, 427, 430, 431, 437, 439 
icc-conditional branches 427 


IEEE Std 754-1985 xxxiii, 119, 122, 126, 127, 147 
IEEE 754 exception floating-point trap type xxxiii, 121, 


122, 124, 127 
IER register (SPARC V8) 422 
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HU 
branch prediction statistics 234 
stall counts 234 
illegal address aliasing 206 
illegal_instruction exception 76, 105, 176, 261, 285, 290, 
295, 317, 320, 355, 357, 379, 386, 387, 389, 395, 
405, 407, 417, 419, 434, 435, 436, 443, 444, 445, 446 
illegal_instruction exception 381 
ILLTRAP instruction 316 
images 
band interleaved 70 
band sequential 70 
imm_asi field of instructions 138, 175, 291, 319, 321, 
323, 325, 329, 330, 432, 433, 435 
imm22 field of instructions 175 
I-MMU 
disabled 195 
Enable bit 129 
and instruction prefetching 195 
implementation 
dependency xxvi 
implementation note xxx 
implementation number (imp) field of VER register 116 
implicit 
ASI 138 
byte order 110 
in registers 74, 78, 392 
inexact accrued (nxa) bit of aexc field of FSR register 127 
inexact current (nxc) bit of cexc field of FSR register 127 
inexact mask (NXM) bit of TEM field of FSR register 124 
inexact quotient 429, 430 
initiated xxxiv 
instruction 
bypass 44 
conditional branch 39 
dependency check 42 
dispatching properties 49 
execution order 42 
explicit synchronization 278 
grouping rules 41-45 
latency 42, 49 
multicycle, blocking 42 
number completed 233 
prefetch 25, 195, 249 
window-saving 46 
with helpers 47 
writing integer register 43 
Instruction Cache 201 
physically indexed 


physically tagged 201 
instruction cache 
effect of mode change 202 
reference counts 237 
instruction fields 

a 174, 284, 288, 291, 424, 427 

cc0 174, 287, 288, 301, 353 

ccl 174, 287, 288, 301, 353 

cc2 174, 353 

cond 174, 287, 288, 345, 353, 424, 427 

const22 316 

d16hi 174, 284 

dl6lo 174, 284 

definition xxxiv 

disp19 174, 287, 288 

disp22 175, 424, 427 

disp30 175, 290 

fcn 294 

i 175, 268, 313, 315, 317, 319, 321, 323, 325, 329, 
330, 336, 353, 356, 358, 379, 389, 391, 429, 432, 
433, 435, 437, 439, 440 

imm, asi 138, 175, 291, 319, 321, 323, 325, 432, 433, 
435 

imm22 175 

mmask 175, 441 

op3 175, 268, 291, 294, 313, 315, 317, 319, 321, 323, 
325, 329, 330, 336, 358, 385, 389, 391, 429, 432, 
433, 435, 437, 439, 440 

opf 175, 299, 301, 303, 305, 306, 308, 310, 312 

opf. cc 175, 345 

opf. low 175, 345, 349 

p 175, 284, 287, 288 

rcond 175, 284, 349, 356 

rd 175, 268, 291, 299, 303, 305, 306, 308, 310, 312, 
317, 319, 321, 323, 325, 329, 330, 336, 345, 349, 
353, 356, 358, 379, 385, 389, 429, 432, 433, 435, 
437, 439, 440 

reserved 261 

rsl 175, 268, 284, 291, 299, 301, 310, 313, 317, 319, 
321, 323, 325, 329, 330, 336, 349, 356, 358, 385, 
389, 391, 429, 432, 433, 435, 437, 439, 440 

rs2 175, 268, 291, 299, 301, 303, 305, 306, 308, 310, 
312, 313, 317, 319, 321, 323, 325, 329, 330, 336, 
345, 349, 353, 356, 358, 379, 391, 429, 432, 433, 
435, 437, 439 

shcnt32 175 

shcnt64 175 

simm10 175, 356 

simmll 175, 353 
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simm13 175, 268, 313, 317, 319, 321, 323, 324, 329, 
330, 336, 358, 379, 390, 429, 432, 433, 435, 437, 
439 

sw_trap# 176 

x 176 

instruction set architecture (ISA) xxxiv 
instruction_access_error exception 25, 249 
instruction_access_exception exception 110 
instructions 

alignment 137, 137, 270 

array addressing 150, 271 

atomic 292 

atomic load-store 291, 292, 329, 330, 446, 448 

block load and store 275 

branch if contents of integer register match condition 
283 

branch on floating-point condition codes 423 

branch on floating-point condition codes with 
prediction 285 

branch on integer condition codes 425 

branch on integer condition codes with prediction 288 

causing illegal instruction 316 

compare and swap 291 

comparison 144, 412 

control-transfer (CTIs) 154, 294 

convert between floating-point formats 304 

convert floating-point to integer 302 

convert integer to floating-point 306 

count of number of bits 379 

divide 357, 428 

DONE 109, 294 

edge handling 151, 296 

floating-point add and subtract 298 

floating-point compare 118, 119, 300, 300 

floating-point move 308 

floating-point multiply and divide 310 

floating-point operate (FPop) 120, 124, 146, 432 

floating-point square root 312 

flush instruction memory 313 

flush register windows 315 

formats 171 

generate software-initiated reset 403 

jump and link 155, 317 

load floating-point 431 

load floating-point from alternate space 320 

load integer 322, 433 

load integer from alternate space 324, 434 

load quadword 327 

load-store unsigned byte 292, 329, 447, 448 


load-store unsigned byte to alternate space 330 

logical 335 

logical operate 334 

move floating-point register if condition is true 343 

move floating-point register if contents of integer 
register satisfy condition 349 

move integer register if contents of integer register 
satisfies condition 356 

multiply 357, 436, 436 

ordering MEMBAR 153 

partial store 360 

partitioned add/subtract 151, 362 

partitioned multiply 364 

permuting bytes specified by GSR.MASK 282 

pixel compare 152, 370 

pixel component distance 371 

pixel formatting (PACK) 151, 372 

prefetch data 379 

read privileged register 385 

read state register 388, 440 

register window management 153 

reserved 176 

reserved fields 261 

RETRY 109, 294 

RETURN vs. RESTORE 391 

sequencing MEMBAR 153 

set high bits of low word 397 

set interval arithmetic mode 396 

setting GSR.MASK field 150, 282 

shift 143, 398 

shift count 399 

short floating-point load/store 401 

shut down to enter power-down mode 402 

software-initiated reset 403 

store 408 

store floating point 404 

store floating-point into alternate space 406, 406 

store integer 408 

store integer into alternate space 410 

subtract 411, 411 

swap r register with alternate space memory 448 

swap r register with memory 446 

tagged addition 413 

tagged arithmetic 143 

tagged subtraction 414 

timing 261 

trap on condition codes 416 

trap on integer condition codes 415 

unimplemented 176 
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write privileged register 417 
writing privileged register 419 
integer register file access 38 
integer unit (IU) 
condition codes 92 
interrupt 
enable (IE) field of PSTATE register 112 
on floating-point instructions 96 
global registers 109 
level 113 
request xxxiv 
trap 217 
vector dispatch 216 
vector dispatch register 219 
vector dispatch status register 220 
vector receive 217 
vector receive register 221 
Interrupt Vector Dispatch Status Register 220 
interrupt_vector exception 97 
interrupt_vector trap 109 
invalid accrued (nva) bit of aexc field of FSR register 126 
invalid current (nvc) bit of cexc field of FSR register 126 
invalid mask (NVM) bit of TEM field of FSR register 124 
invalid_exception exception 303 
invalid_fp_register floating-point trap type 83, 121, 309, 
313 
invalidation 
prefetch cache 381 
issued xxxiv 
ITID field of Interrupt Vector Dispatch register 217 


J 
JMPL instruction 25, 39, 249 
computing target address 155 
description 317 
destination register 93 
does not change CWP 80 
reexecuting trapped instruction 391 
jump and link (JMPL) instruction 155, 317 


K 
kernel code 222 


L 
L2 203 
L2-Cache 203, 207 
L2-cache 184, 205, 207, 277 
latency 
BMASK and BSHUFFLE 283 
floating-point operations 44 
FPADD instruction 362 
partitioned multiply 364 
LD instruction (SPARC V8) 323 
LDD instruction 197, 322, 433 
LDDA instruction 76, 324, 326, 434 
LDDEF instruction 137, 318, 431 
LDDF mem address not aligned exception 137, 322 
LDDEFA instruction 137, 274, 320, 361, 400 
LDF instruction 318, 431 
LDFA instruction 320 
LDFSR instruction 47, 118, 120, 121, 197, 431 
LDQF instruction 176, 318, 431 
LDQFA instruction 320 
LDSB instruction 197, 322, 433 
LDSBA instruction 324, 434 
LDSH instruction 197, 322, 433 
LDSHA instruction 324, 434 
LDSTUB instruction 139, 191, 329, 331 
LDSTUBA instruction 329, 330 
LDSW instruction 197, 322, 433 
LDSWA instruction 324, 434 
LDUB instruction 322, 433 
LDUBA instruction 324, 434 
LDUH instruction 322, 433 
LDUHA instruction 324, 434 
LDUW instruction 322, 433 
LDUWA instruction 324, 434 
LDX instruction 322, 433 
LDXA instruction 324, 434 
LDXFSR instruction 117, 118, 120, 121, 197, 318, 431 
level-1 cache 199 
flushing 205 
little-endian 
ordering in partial store instructions 361 
little-endian byte order xxxv, 110, 136 
load floating-point from alternate space instructions 320 
load floating-point instructions 431 
load instructions xxxv 
load instructions, getting data from store queue 197 
load integer from alternate space instructions 324, 434 
load integer instructions 322, 433 
load quadword atomic 326 





Index 469 


load recirculation 198 

LoadLoad MEMBAR relationship 338 

loads 
from alternate space 92, 138 

load-store alignment 137, 137 

load-store instructions 139 
compare and swap 291 
definition xxxv 
load-store unsigned byte 292, 329, 447, 448 
load-store unsigned byte to alternate space 330 
swap r register with alternate space memory 448 
swap r register with memory 292, 446 

LoadStore MEMBAR relationship 338 

local registers 74, 78, 392 

logical instructions 335 

Lookaside MEMBAR relationship 339 

Low Power 402 

lower registers dirty (DL) field of FPRS register 94 


M 
M pipeline stage 39 
machine state 
after reset 253 
in RED state 253 
mask number (mask) field of VER register 117 
maximum trap levels (MAXTL) field of VER register 
117 
MAXTL 112, 403 
may (keyword) xxxv 
mem, address not aligned exception 137, 293, 318, 319, 
320, 322, 323, 324, 325, 326, 391, 392, 402, 405, 
407, 409, 410, 434, 436, 443, 445, 446, 447, 449 
MEMBAR 
#LoadLoad 186, 338 
#LoadStore 186, 338 
#LoadStore and block store 278 
#Lookaside 184 
#Memlssue 184, 340 
#StoreLoad 338 
and BLD 278 
and BST 278 
for strong ordering 340 
#StoreStore 314, 338 
and BST 278 
code example 186 
#Sync 206 
after BST 278 


after internal ASI store 195 
BLD and BST 277 
semantics 188 
for strong ordering 340 
instruction 153, 175, 218, 313, 337, 389, 441 
explicit synchronization 186 
grouping rules 47 
memory ordering 187 
side-effect accesses 194 
single group 47 
QUAD_LDD requirement 342 
rules for interlock implementation 339 
UltraSPARC-III specifics 339 
Memlssue MEMBAR relationship 339 
memory 
access instructions 139 
cached 184 
current model, indication 184 
global visibility of memory accesses 186 
location 184 
models 
and block operations 278 
ordering and block store 278 
partial store order (PSO) 183, 278 
relaxed memory order (RMO) 278 
strongly ordered 196, 340 
total store order (TSO) 183 
total store order (TSO)TSO 278 
ordering 186 
synchronization 187 
memory model (MM) field of PSTATE register 111 
memory-mapped I/O 185 
merge buffer 196 
mispredict signal 39 
mmask field of instructions 175, 441 
MMU 
global registers 109 
mode 
privileged 104 
user 92 
MOVA instruction 351 
MOVCC instruction 351 
MOVcc instructions 92, 118, 140, 174, 175, 348, 355 
grouping rules 48 
MOVCS instruction 351 
move floating-point register if condition is true 343 
move floating-point register if contents of integer register 
satisfy condition 349 
MOVE instruction 351 
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move integer register if contents of integer register 
satisfies condition instructions 356 
MOVFA instruction 352 
MOVFE instruction 352 
MOVFG instruction 352 
MOVFGE instruction 352 
MOVEL instruction 352 
MOVFLE instruction 352 
MOVFI G instruction 352 
MOVFN instruction 352 
MOVFNE instruction 352 
MOVFO instruction 352 
MOVFU instruction 352 
MOVFUE instruction 352 
MOVFUG instruction 352 
MOVFUGE instruction 352 
MOVFUL instruction 352 
MOVFULE instruction 352 
MOVG instruction 351 
MOVGE instruction 351 
MOVGU instruction 351 
MOVL instruction 351 
MOVLE instruction 351 
MOVLEU instruction 351 
MOVN instruction 351 
MOVNE instruction 351 
MOVNEG instruction 351 
MOVPOS instruction 351 
MOVR instructions 
grouping rules 48 
MOVr instructions 175, 356 
MOVRGEZ instruction 356 
MOVRGZ instruction 356 
MOVRLEZ instruction 356 
MOVRLZ instruction 356 
MOVRNZ instruction 356 
MOVRZ instruction 356 
MOVVC instruction 351 
MOVVS instruction 351 
MS pipeline 
description 37 
E-stage bypass 42 
integer instruction execution 39 
and W-stage 40 
multiple unsigned condition codes, emulating 141 
multiply instructions 357, 436, 436 
multiprocessor synchronization instructions 292, 447, 
448 
multiprocessor system 313, 447, 448, 449 





Index 


MULX instruction 357 
must (keyword) xxxv 
mutual exclusion, atomic instructions 191 


N 


NaN (not-a-number) 
converting floating-point to integer 303 
quiet 301 
signalling 119, 301, 305 
negative (N) bit of condition fields of CCR 91 
next program counter (nPC) 93, 105, 177, 294, 359 
noncacheable 
accesses 185 
I/O address 191 
instruction prefetch 25, 195, 249 
store compression 196 
store merging enable 129 
nonfaulting 
ASIs and atomic accesses 192 
load 
and TLB miss 192 
behavior 192 
use by optimizer 192 
nonfaulting load xxxvi 
nonleaf routine 318 
nonprivileged 
mode xxxi, 121 
software 93 
nonprivileged trap (NPT) field of TICK register 389 
nonstandard floating-point operation 119 
NOP instruction 287, 358, 416, 424, 427 
note 
implementation xxx 
programming xxx 
nPC register, See next program counter (nPC) 
NS field of FSR 119 
Nucleus code 222 
NWINDOWS 78, 78, 393 


O 

op3 field of instructions 175, 268, 291, 294, 313, 315, 
317, 319, 321, 323, 325, 329, 330, 336, 358, 385, 
389, 391, 429, 432, 433, 435, 437, 439, 440 

opcode 
definition xxxvi 
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opf field of instructions 175, 299, 301, 303, 305, 306, 
308, 310, 312 

opf_cc field of instructions 175, 345 

opf_low field of instructions 175, 345, 349 

OR instruction 335 

ORcc instruction 335 

ordering 
block load 278 
block store 278 

ordering MEMBAR instructions 153 

ORN instruction 335 

ORNcc instruction 335 

other windows (OTHERWIN) register 114, 315, 385, 
393,418 

out register #7 76 

out registers 78, 392 

overflow (V) bit of condition fields of CCR 91, 143 


overflow accrued (ofa) bit of aexc field of FSR register 


126 


overflow current (ofc) bit of cexc field of FSR register 126 
overflow mask (OFM) bit of TEM field of FSR register 


124 


P 
p field of instructions 175, 284, 287, 288 
PA Data Watchpoint Register 
DB_PA field 133 
format 133 
PA_watchpoint exception 132 
packed-to-planar conversion 151, 378 
partial store instruction 45 
partial store instructions 359 
partitioned multiply instructions 364 
PC register, See program counter (PC) 
PC, Instr_cnt 233 
PC Ist rd 239 
PC 2nd rd 239 
PC counter inv 239 
PC hard hit 239 
PC MS misses 239 
PC. soft hit 239 
PCR 
access 228 
fields 
PRIV 229 
ST(system trace enable) field 229 
SU (select upper bits of PIC) field 229 


UT (user trace enable) field 229 
function 
Cycle cnt 233 
DC hit 238 
Dispatch0_2nd br 235 
Dispatch0 br target 235 
Dispatch0. IC miss 234 
Dispatch0_mispred 235 
EC. ref 239 
EC. snoop inv 240 
EC. snoop wb 240 
EC wb 240 
EC write hit clean 240 
IC ref 237 
SI snoops 243 
PRIV field 228 
ST field 228, 233 
UT field 228, 233 
PDIST instruction 371 
PDIST, instruction latency 45 
performance hints 
FPACK usage 373 
FPADD usage 362 
logical operate instructions 334 
partitioned multiply usage 364 
physical address 
data watchpoint 133 
Physical Indexed Caches 201 
Physical Tagged Caches 201 
physical-indexed 
physical-tagged (PIPT) cache 203 
PIC register 
and PCR 228 
access 228 
PICO Events 244 
PIC1 Events 244 
PICL field 230 
SL selection bit field encoding 244 
pipeline 
AO 37, 38 
Al 37 
BR 37 
conditional moves 48 
dependencies 38 
FFA 37, 244 
FGA xxxili, 283 
FGM xxxiii, 37, 244 
MS 37, 39, 40 
stages 
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A 36, 39 
B 37 
C 39, 40 
D 40, 234 
E 38 
F 36 
137 
M39 
mnemonics 32 
R 38, 236 
T40 
W 40 
stalls, causes 234 
pixel instructions 
comparison 152, 370 
component distance 371 
formatting 151, 372 
planar-to-packed conversion 378 
POK pin 250 
POPC instruction 176, 378 
power-on reset (POR) 102, 103 
system reset when Reset pin activated 26 
Power-On-Reset (POR) 250 
precise floating-point traps 387 
predict bit 284 
prefetch 
instruction, noncacheable 25, 249 
instructions 195 
noncacheable data 381 
Prefetch Cache 
physically indexed 
physically tagged 202 
prefetch cache 
invalidation 381 
valid bits 25, 250 
prefetch data instruction 379 
PREFETCH instruction 160, 379 
descriptions 193 
types 381 
PREFETCHA instruction 379 
priority 
VA vs. PA_watchpoint 132 
privileged 
mode 104 
registers 104 
software 78, 111, 120, 138,315 
privileged (PRIV) field of PSTATE register 112, 293, 
321, 331, 389, 407, 410, 446, 449 
privileged mode (PRIV) field of PSTATE register 112 


privileged registers 46 
privileged_action exception 92, 138, 219, 220, 221, 222, 
293, 321, 322, 325, 326, 331, 389, 390, 407, 410, 
435, 436, 446, 449 
privileged action exception 184, 191, 228, 230 
PIC access 229 
privileged opcode exception 222, 295, 387, 395, 419 
privileged opcode exception 228 
processor interrupt level (PIL) register 113, 223, 385, 418 
processor pipeline 
address stage 36 
branch target computation stage 37 
cache stage 39 
done stage 40 
execute stage 38 
fetch stage 36 
instruction issue 37 
register stage 38 
trap stage 40 
processor state (PSTATE) register 77, 105, 107, 110, 294, 
385, 418 
program counter (PC) 93, 104, 177, 291, 294, 317, 359 
programming note xxx 
PSO memory model 183, 186, 187, 194 
PSR register (SPARC V8) 422 
PSTATE 
.PEF 98 
AM field 112 
global register selection encodings 108 
IE field 97, 223 
IG field 108, 109, 218 
MG field 108, 109 
MM field 184 
PEF field 422 
PRIV field xxxvi, xxxvii, 184, 191 
RED field 96 
exiting RED. state 25, 195, 249 
register 109 
WRPR instruction and BST 278 


Q 
Quad FPop instructions 176 
quad load instruction 197, 342 
quadword 

addressing 180 

alignment 137 

data format 59 
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definition xxxvii 
quiet NaN (not-a-number) 119, 301 


R 

R pipeline stage 38 

r register 
#15 76 
categories 75 
special-purpose 76 
alignment 434, 435 

rational quotient 430 

R-A-W 
Bypass Enable bit in DCUCR 129 
bypassing algorithm 197 
bypassing data from store queue 129 
detection algorithm 198 

rcond field of instructions 175, 284, 349, 356 

rd field of instructions 175, 268, 291, 299, 303, 305, 306, 
308, 310, 312, 317, 319, 321, 323, 325, 329, 330, 
336, 345, 349, 353, 356, 358, 379, 385, 389, 429, 
432, 433, 435, 437, 439, 440 

RDASI instruction 388, 388, 440 

RDASR 
format 98 

RDASR instruction 94, 228, 388, 388, 440, 441 
dispatching 46 
forcing bubbles before 46 

RDCCR instruction 50, 388, 388, 440 

RDDCR instruction 388 

RDFPRS instruction 388, 388, 440 

RDGSR instruction 388 

RDPC instruction 93, 388, 388, 440 

RDPIC instruction 229, 388 

RDPR FQ instruction 176 

RDPR instruction 104, 108, 113, 116, 385, 390 
dispatching 46 
forcing bubbles before 46 

RDSOFTINT instruction 388 

RDSTICK instruction 388 

RDSTICK, CMPR instruction 388 

RDTICK instruction 388, 388, 390, 440 

RDTICK, CMPR instruction 388 

RDY instruction 90 

Re DC miss counter 236 

Re EC miss counter 237 

Re FPU bypass counter 236 

Re PC miss counter 237 


Re RAW miss counter 236 
read privileged register (RDPR) instruction 385 
read state register instructions 388, 440 
real memory 184 
recirculation instrumentation 236 
RED. state 249 
exiting 195 
trap vector 27, 252 
RED state (RED) field of PSTATE register 110 
register 
access 
floating-point 39 
integer 38 
Floating-Point Status (FSR) 119 
global trap 109 
PSTATE 109 
register window management instructions 153 
register windows 78 
clean 115 
fill 78, 154, 393, 395 
spill 78, 154, 393, 395 
registers 
address space identifier (ASI) 294, 321, 325, 331, 
380, 407, 410, 422, 435, 446, 448 
alternate global 76 
ancillary state registers (ASRs) 90, 94 
ASI 92, 105 
CANRESTORE 114 
CANSAVE 114 
clean windows (CLEANWIN) 114, 114, 385, 418 
CLEAR, SOFTINT 223 
condition codes register (CCR) 105, 268, 294, 422, 
439 
control and status 90 
current window pointer (CWP) 78, 105, 114, 114, 
115, 294, 315, 385, 393, 418 
Data Cache Unit Control (DCUCR) 128 
dispatch control register (DCR) 95 
floating-point 83 
floating-point registers state (FPRS) 93, 389, 422 
floating-point state (FSR) 117, 124, 125, 127, 432, 
442 
global 74, 76, 76 
IER (SPARC V8) 422 
in 74, 78, 392 
Interrupt Vector Dispatch register 219 
Interrupt Vector Dispatch Status register 220 
Interrupt Vector Receive register 221 
local 74, 78, 392 
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other windows (OTHERWIN) 114, 315, 385, 393, 
418 
out 78, 392 
out #7 76 
PC 93 
performance control (PCR) 228 
privileged 104 
processor interrupt level (PIL) 113, 385, 418 
processor state (PSTATE) 77, 105, 107, 110, 294, 
385, 418 
PSR (SPARC V8) 422 
r15 
r register #15 76 
restorable windows (CANRESTORE) 78, 114, 115, 
385, 393, 395, 418 
savable windows (CANSAVE) 78, 114, 114, 315, 
385, 393, 395, 418 
SET_SOFTINT 223 
SOFTINT 222 
TBR (SPARC V8) 422 
TICK 102, 103, 385, 418 
TICK_COMPARE 103 
trap base address (TBA) 107, 385, 418 
trap level (TL) 104, 107, 112, 112, 115, 117, 294, 
385, 386, 395, 403, 418, 419 
trap next program counter (TNPC) 105, 385, 418 
trap program counter (TPC) 385, 387, 418 
trap state (TSTATE) 105, 109, 294, 385, 418 
trap type (TT) 105, 107, 115, 385, 416, 418 
version register (VER) 116, 385 
WIM (SPARC V8) 422 
window state (WSTATE) 113, 115, 315, 385, 393, 
418 
Y 90, 90, 429, 437, 439, 453 
reserved 
fields in instructions 261 
instructions 176 
reset 
power-on 102, 103 
reset trap 102, 103 
system 26 
restorable windows (CANRESTORE) register 78, 114, 
115, 385, 393, 395, 418 
RESTORE instruction 392—394 
actions 154 
and current window 79 
decrementing CWP register 78 
followed by SAVE instruction 80 
managing register windows 153 


Index 


operation 392 
performance trade-off 393 
and restorable windows (CANRESTORE) register 
114 
restoring register window 393 
SPARC V9 vs. SPARC V8 115 
RESTORED instruction 154, 394, 394, 394 
use by privileged software 153 
RESTORED instruction, single group 46 
restricted address space identifier 138 
restricted ASI 184 
RETRY instruction 92, 97, 109, 155, 294 
after internal store to ASI 196 
and BST 278 
exiting RED_state 25, 249 
grouping rules 47 
restoring AG, IG, MG bits 109 
use with IFPOE 97 
when TSTATE uninitialized 25, 250 
RETURN instruction 39, 390—392 
computing target address 155 
destination register 93 
operation 390 
reexecuting trapped instruction 391 
RMO memory model 183, 186, 187, 194, 278 
rounding 
behavior in GSR 98 
for floating-point results 119 
in signed division 430 
rounding direction (RD) field of FSR register 119, 299, 
303, 305, 307, 311, 312 
routine, nonleaf 318 
rsl field of instructions 175, 268, 284, 291, 299, 301, 
310, 313, 317, 319, 321, 323, 325, 329, 330, 336, 
349, 356, 358, 385, 389, 391, 429, 432, 433, 435, 
437, 439, 440 
rs2 field of instructions 175, 268, 291, 299, 301, 303, 
305, 306, 308, 310, 312, 313, 317, 319, 321, 323, 
325, 336, 345, 349, 353, 356, 358, 379, 429, 432, 
433, 435, 437, 439 
R-stage stall counts 236 
Rstall FP. use counter 236 
Rstall IU. use counter 236 
Rstall_storeQ counter 236 
RSTVaddr 27, 252 
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S 


savable windows (CANSAVE) register 78, 114, 114, 315, 


385, 393, 395, 418 
SAVE instruction 392-394 
actions 154 
after RESTORE instruction 391 
and current window 79 
decrementing CWP register 78 
leaf procedure 318 
and local/out registers of register window 80 
managing register windows 153 
no clean window available 115 
number of usable windows 114 
operation 392 
performance trade-off 393 
and savable windows (CANSAVE) register 114 
SPARC V9 vs. SPARC V8 115 
SAVED instruction 153, 154, 394, 394, 394 
SAVED instruction, single group 46 
Scalable Processor Architecture see SPARC 
scaling of the coefficient 364 
SDIV instruction 90, 428 
SDIVcc instruction 90, 428 
SDIVX instruction 357 
self-modifying code 314 
sequence_error floating-point trap type 121 
sequencing MEMBAR instructions 153 
SET_SOFTINT pseudo-register 223 
SETCC instruction, grouping 43 
SETHI instruction 143, 144, 175, 359, 397, 397 
SFSR 
FT field 
FT = 10 192 
FT = 2 185, 192, 194 
FT =4 191 
FT =8 191, 192 
shall (keyword) xxxviii 
shcnt32 field of instructions 175 
shcnt64 field of instructions 175 
shift count encodings 399 
shift instructions 143, 144, 398 
short floating-point load and store instructions 400 
short floating-point load instruction 197 
should (keyword) xxxix 
SHUTDOWN instruction 402 
SIAM instruction 395 
grouping rules 45 
rounding 396 
setting GSR fields 396 


side effect 
accesses 185, 194 
and block load 278 
instruction placement 195 
instruction prefetching 195 
visible 185 
signalling NaN (not-a-number) 119, 301, 305 
signed integer data type 59 
sign-extended 64-bit constant 175 
simm10 field of instructions 175, 356 
simml1 field of instructions 175, 353 
simm13 field of instructions 175, 268, 313, 317, 319, 
321, 323, 325, 329, 330, 336, 358, 379, 391, 429, 
432, 433, 435, 437, 439 
single-instruction group 42, 43, 46, 47, 50 
SIR instruction 26, 251, 403, 421 
grouping rule 47 
SLL instruction 398, 398 
SLLX instruction 398, 398 
SMUL instruction 90, 436 
SMULcc instruction 90, 436 
snooping 
snoop counts 243 
SOFTINT register 222 
software interrupt (SOFTINT) register 
clearing 223 
in code sequence for Interrupt Receive 218 
scheduling interrupt vectors 222 
setting 223 
software statistics, counters 243 
software trap 416 
software initiated reset (SIR) 26, 403 
Software-Initiated Reset (SIR) 47, 251 
SPARC xxv 
Architecture Manual, Version 9 xxv 
brief history xxv 
International, address of xxvi 
V9, architecture xxv 
SPARC V8 compatibility 
ADDC/ADDCcc renamed 269 
current window pointer (CWP) register differences 
115 
delay instruction 155 
delay instruction fetch 158 
executing delayed conditional branch 158 
existing nonprivileged SPARC V8 software 77 
instruction between FBfcc /FBPfcc 287 
LD, LDUW instructions 323 
level 15 interrupt 113 
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read state register instructions 390 
STA instruction renamed 410 
STBAR instruction 339, 441 
STD instruction 444 
STDA instruction 446 
STFSR instruction 443 
tagged add instructions 450 
tagged subtract instructions 452 
Ticc instruction 417 
UNIMP instruction renamed 316 
write state register instructions 422 
SPARC V9 
compliance xxxvi 
speculative load 185 
spill register window 78, 154, 393, 395 
spill windows 393 
spill n normal exception 316, 394 
spill n other exception 316, 394 
SRA instruction 398, 398 
SRAX instruction 398, 398 
SRL instruction 398, 398 
SRLX instruction 398, 398 
stable storage 206 
stack frame 393 
stalls 
counted 234 
pipeline 234 
R Stage counts 236 
STB instruction 408 
STBA instruction 409 
STBAR instruction 187, 339, 389 
STDA instruction 76 
STDF instruction 137, 404 
STDF. mem address not aligned exception 137, 405, 
407 
STDFA instruction 137, 274, 359, 400, 406, 406 
STF instruction 404 
STFA instruction 406 
STFSR instruction 117, 118, 120 
STH instruction 408 
STHA instruction 409 
STICK register 388 
STICK. COMPARE register 103, 388 
STICK INT 223 
store 
buffer 
merging 194 
compression 185, 196 
instructions, giving data to a load 197 





noncacheable, coalescing 196 
queue 
R-stage stall count 236 

store floating-point into alternate space instructions 406 
store instructions xxxix 
StoreLoad MEMBAR relationship 338 
stores to alternate space 92, 138 
StoreStore MEMBAR relationship 338 
STOF instruction 176, 404 
STOFA instruction 406, 406 
strongly ordered memory model 196, 340 
STW instruction 408 
STWA instruction 409 
STX instruction 408 
STXA instruction 409 
STXFSR instruction 117, 118, 120, 404 
SUB instruction 411, 411 
SUBC instruction 411, 411 
SUBcc instruction 144, 411, 411 
SUBCcc instruction 411, 411 
subtract instructions 411 
supervisor software 77, 121, 138 
SW_count_0 243 
SW count 1 243 
sw_trap# field of instructions 176 
SWAP instruction 191, 329, 331, 446 
swap r register with alternate space memory instructions 

448 
swap r register with memory instructions 292, 446 
SWAPA instruction 329, 331, 448 
Sync MEMBAR relationship 338 
Synchronous Fault 

Status Registers(SFSR) 

Extensions 


Differences From 
PARC-I210 


UltraS- 


system interface 
statistics, counters 243 
system interface unit (SIU) instructions 39 
system software 314 
system timer interrupt, STICK. INT 223 


T 

T pipeline stage 40 

TA instruction 415 

TADDoGc instruction 143, 412 
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TADDccTV instruction 143 
tag overflow 143 


tag_overflow exception 143, 413, 414, 450, 452 


tagged arithmetic instructions 143 
tagged word data format 59 
tagged words 59 
TBR register (SPARC V8) 422 
TCC instruction 415 
Tcc instructions 92, 174, 176, 415 
TCS instruction 415 
TE instruction 415 
TG instruction 415 
TGE instruction 415 
TGU instruction 415 
Ticc instruction (SPARC V8) 417 
TICK 
_CMPR.INT_DIS field 222 
TICK_COMPARE register 103 
TICK_INT 223 
timer interrupt, TICK_INT 223 
timing of instructions 261 
TL instruction 415 
TL register 419 
TLB 
and 3-dimensional arrays 273 
data access 39 
Data Access Register 210 
Diagnostic Register 211 
flushing 209 
hit xxxix 
miss and nonfaulting load 192 
miss counts 237 
TLE instruction 415 
TLEU instruction 415 
TN instruction 415 
TNE instruction 415 
TNEG instruction 415 


total store order (TSO) memory model 111 


TPOS instruction 415 

trap 
atomic accesses 191 
atomic instructions 191 
fp_disabled 

GSR access 422 

fp. disabled 96 
fp exception ieee 75497 
[p exception other 97, 119 
level 112 
noncacheable accesses 185 
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stack 108 
VA /PA watchpoint 132 
trap base address (TBA) register 107, 385, 418 
trap enable mask (TEM) field of FSR register 123, 124 
trap globals 109 
trap handler 295 
user 121 
trap level (TL) register 104, 107, 112, 112, 115, 117, 
294, 385, 386, 395, 403, 418, 419 
trap next program counter (TNPC) register 105, 385, 418 
trap on integer condition codes instructions 415 
trap program counter (TPC) register 385, 387, 418 
trap state (TSTATE) register 105, 109, 294, 385, 418 
trap type (TT) register 105, 107, 115, 385, 416, 418 
trap instruction (ISA) exception 416, 417 
trap little endian (TLE) field of PSTATE register 110, 
110 
traps 
software 416 
TSO memory model 183, 184, 185, 186, 187, 194 
TSTATE register 
initializing 25, 250 
PEF field 97 
TSUB-cc instruction 143, 413 
TSUBccTV instruction 143 
TTE 
CP (cacheability) field 185, 191 
CV (cacheability) field 185, 191 
E field 184, 185, 186, 192, 194 
format 210 
NFO field 192 
TVC instruction 415 
TVS instruction 415 


e 


ART 185 

DIV instruction 90, 428 

DIVcc instruction 90, 428 

DIVX instruction 357 

ItraSPARC-I 339 

ItraSPARC-II 339 

MUL instruction 90, 436 

MULcc instruction 90, 436 

unconditional branches 287, 289, 424, 427 

underflow accrued (ufa) bit of aexc field of FSR register 
127 

underflow current (ufc) bit of cexc field of FSR register 
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127 


underflow mask (UFM) bit of TEM field of FSR register 


124, 127 
unfinished_F Pop exception 119 
unfinished_F Pop exception 304, 305, 307 


unfinished_F Pop floating-point trap type 121, 122, 127, 


311 
UNIMP instruction (SPARC V8) 316 
unimplemented instructions 176 


unimplemented_F Pop floating-point trap type 121, 123, 


127, 300, 302, 304, 306, 307, 311, 348, 350 
unsigned integer data type 59 


upper registers dirty (DU) field of FPRS register 94 


user 
mode 92 
trap handler 121 


V 
VA Data Watchpoint Register 
DB VA field 132 
VA watchpoint exception 132 
version register (VER) 116, 385 
virtual address 184 
data watchpoint 132 
virtual address 0 192 
Virtual Indexed, Physical Tagged Caches 199 
virtual-indexed 
physical-tagged (VIPT) cache 199 
virtual-to-physical address translation 184 
VIS instruction execution 39 
Visual Instruction Set (VIS) 97 


W 
W pipeline stage 40 
watchdog_reset (WDR) 26, 251 
watchpoints 

data registers 132 
WC miss 238 
WC. scrubbed 238 
WC. snoop cb 238 
WC. wb wo read 238 
WIM register (SPARC V8) 422 
window changing 46 
window fill trap handler 153 
window overflow 78 


window spill trap handler 153 
window state (WSTATE) register 
description 115 
overview 113 
reading WSTATE with RDPR instruction 385 
spill exception 315 
spill trap 393 
writing WSTATE with WRPR instruction 418 
window underflow 78 
window, clean 392 
window. fill exception 115, 391 
window. spill exception 115 
word 
addressing 180 
alignment 137 
data format 59 
Working Register File (WRF) 46 
WRASI instruction 420 
WRASR 
format 98 
WRASR instruction 94, 228, 420 
forcing bubbles after 46 
grouping rule 46 
WRDCR instruction 420 
WRGSR instruction 420 
WRPCR instruction 420 
WRPIC instruction 420 
WRSOFTINT instruction 420 
WRSOFTINT_CLR instruction 420 
WRSOFTINT SET instruction 420 
WRSTICK instruction 420 
WRSTICK_CMPR instruction 420 
WRTICK_CMP instruction 420 
WRCCR instruction 92, 420 
WRF (Working Register File) 46 
WRFPRS instruction 420 
WRGSR instruction 45 
WRIER instruction (SPARC V8) 422 
Write Cache 203 
write cache 
miss counts 238 
write privileged register instruction 417 
WRPIC instruction 229 
WRPR instruction 102, 108, 113, 417, 417 
forcing bubbles after 46 
grouping rule 46 
to PSTATE and BST 278 
WRPSR instruction (SPARC V8) 422 
WRTBR instruction (SPARC V8) 422 


Index 
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WRWIM instruction (SPARC V8) 422 
WRY instruction 90, 420 


X 

x field of instructions 176 

xcc field of CCR register 92, 268, 290, 336, 355, 412, 
413, 430, 431, 437, 439 

XNOR instruction 335 

XNORcc instruction 335 

XOR instruction 335 

XORcc instruction 335 


Y 
Y register 90, 90, 429, 437, 439, 453 


Z 
zero (Z) bit of condition fields of CCR 91 
zero virtual address 192 
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